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PREFACE 


This book is intended for students in education, who usually 
have had little training in mathematics. For those who have had 
considerable mathematics the theory of statistics is compara- 
tively easy, but for students without such training the more 
advanced statistical methods offer many difficulties. The pres- 
ent volume supplements the mathematical preparation of the 
student by including sections on such topics as graphing, loga- 
rithms, and elementary theory of probability. The proofs of 
difficult theorems have been omitted throughout and demon- 
strations have been included only when experience has shown 
that they come within the grasp of the ordinary student and 
assist in a clear understanding of the method involved. 

Although no attempt has been made to include all statistical 
methods now used in the field of education, the present text 
treats a somewhat larger number than will be found in most 
elementary books. The chief additions to the usual topics are 
the percentile method, application of the normal curve in cor- 
relating qualitative series, partial and multiple correlation, and 
elementary theory of curve fitting. The important subject of 
_ index numbers has been omitted entirely because a satisfactory 
treatment is beyond the scope of this book. The increasing 
need for index numbers in the field of school costs will probably . 
lead to a separate volume on these methods. 

In order to insure a clear understanding of the statistical 
arithmetic involved in the various methods presented, com- 
plete model problems have been worked out in the text. The 
experience of the writer has been that the ordinary student 
has considerable difficulty in formulating his plans for calcu- 


lation and is greatly assisted by detailed arithmetical schemes 
ili 
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for computation, particularly in the early part of the course. A 
considerable number of exercises with answers have been added 
at the end of each chapter to clarify the methods discussed and 
to afford the student sufficient arithmetical practice to enable 
him to become accurate in his work. The amount of such prac- 
tice needed varies greatly with students, and enough exercises 
are included to meet the needs of those requiring most drill. 

The material in this volume will be found sufficient for an 
ordinary course of six months, but it may be condensed for a 
shorter course by the omission of certain topics and chapters. 
For an introductory course in a normal school or college, Chap- 
ters I to IX with selected topics from Chapters XII, XIII, 
and XIV are suggested. In case a second course is offered, the 
last seven or eight chapters with supplementary reading and 
term papers will usually be ample. 
_ The writer is greatly indebted to Professor Karl Pearson, 
Dr. Leonard P. Ayres, and Professor Harold Rugg for ideas 
acquired while he was under their instruction. Valuable advice 
and suggestions have also been contributed by Dr. Egon Pear- 
son, Professor C. H. Judd, Professor F. N. Freeman, Professor 
EK. R. Breslich, Dr. Douglas Scates, and Dr. Ralph Hogan, all 
of whom read the manuscript while in preparation. Additional 
thanks are due to Mr. Lumir Brazda for preparation of the 
diagrams and to Mrs. Bryan Mitchell for assistance in check- 
ing the proof. 

KARL J. HOLZINGER 
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STATISTICAL METHODS 
FOR STUDENTS IN EDUCATION 


CHAPTER I 
INTRODUCTION 


1. THE NEED FOR STATISTICAL METHOD IN DEALING WITH 
EDUCATIONAL PROBLEMS 


In recent years the scientific movement in education has led 
to the wide use of quantitative methods. Problems in school 
administration and in educational theory and practice are now 
being studied chiefly by the application of experimental and 
statistical technique. 

The increasing demand for school surveys and the generous 
appropriations made by the various foundations to promote 
these and other financial inquiries have created a need for 
statistical training for persons conducting such investigations. 
Some of the outstanding problems in such studies are the ap- 
portionment of school funds, school accounting, unit costs, and 
budgetary control, all of which involve careful accumulation of 
data and application of appropriate statistical method. 

Another field in which adequate knowledge of statistics has 
become imperative is that of standardized tests. In modern 
educational science the old types of personal estimate and school 
examination are being replaced by intelligence tests and scales 
for the measuring of achievement in the various school sub- 
jects. Statistical methods are fundamental in the theory of 
test and scale construction and in the interpretation of the 


results obtained from such tests. 
1 


2 STATISTICAL METHODS IN EDUCATION 


In the selection and organization of test material and the 
standardization and preparation in final form, elaborate tech- 
nique is often required. Modern developments in test construc- 
tion have led to the use of more and more refined methods, so 
that the test-maker of today needs to be a thorough student of 
statistics. 

In the application of standardized tests to such problems as 
pupil classification, vocational guidance, diagnosis of special abil- 
ities, and evaluation of methods of instruction, a sound knowl- 
edge of statistical method is imperative, because all such studies 
involve the collection of appropriate data, summarization of 
the results, and correct inferences from the statistical findings. 

The quantitative trend in school investigation has given rise 
to a tremendous bulk of literature. There are now hundreds 
of volumes on school surveys filled with tables and diagrams; 
there are books, monographs, theses, and reports likewise re- 
plete with statistics; there are scores of government, state, and 
institutional pamphlets; there are hundreds of standardized 
tests; and there is an ever-increasing amount of periodical 
literature reporting the findings of quantitative studies. 

It is evident that if the school administrators and teachers 
for whom a large part of this great body of literature was 
written are to understand and apply it, they must have con- 
siderable familiarity with statistical method. It is impossible 
to keep up with the most recent developments in school research 
without some knowledge of the methods upon which such in- 
vestigations are based. 

Professional schools and departments in universities devoted 
to the training of teachers and administrators are meeting the 
demand by courses in experimental and statistical method. 
The purpose of such courses, in general, is to give the student 
sufficient information for intelligent reading of the present 
quantitative literature, and to furnish him with the technique 
necessary for carrying on his own investigations. This twofold 
aim has been kept in mind in preparing the present text. 
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2. SOME GENERAL REQUIREMENTS FOR SUCCESS IN THE 
USE OF STATISTICAL METHOD 


In conducting a statistical study the investigator, survey ex- 
pert, or classroom teacher should have in mind some definite 
problem or purpose, no matter how limited in scope. The 
mere gathering of masses of data or the haphazard calcula- 
tion and plotting of diagrams are of little value unless they 
can be brought to bear upon a problem. While desirable lines 
of investigation are often discovered after the data have been 
collected and tabulated in a tentative way, it is much safer to 
decide upon the problem first and then proceed to collect the 
data necessary for its solution. The selection of a problem 
which is worth while, and which is sufficiently limited so that 
controls may be made and all necessary details carried out 
thoroughly and completely, is perhaps the most difficult part 
of the whole statistical procedure. It requires wide knowledge 
of the general field in which the problem lies, and a certain 
constructive wmagination in foreseeing the various difficulties 
which are likely to arise. 

Another requisite for a good statistical investigation is ade- 
quate data. No matter how excellent the problem or the plan 
of procedure, if the data employed are scanty the results will 
be of little value. Statistical method usually involves some 
generalization based upon summaries of the data. If the data 
- are small in number, therefore, the conclusions drawn will not 
be reliable. This may be illustrated by some unpublished ex- 
periments in maze-learning based upon about twenty-five cases. 
Out of eight similar studies five showed a superiority for one 
method of learning, while the other three showed a difference in 
favor of another method. In all the experiments the number 
of cases was so small that none of the differences obtained 
proved to be significant, but could be readily accounted for 
by mere chance fluctuations in the samples of data chosen. 
While there is no fixed number of cases necessary for making 
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a statistical study, a desirable minimum for experimental work 
is about fifty, provided they are well chosen. 

Data adequate as to number are not alone sufficient to insure 
satisfactory material. The facts gathered must be reliable and 
pertinent to the problem in hand. Questionnaire returns often 
fail in this respect because the intelligent replies of a number of 
persons to whom the blanks are sent are offset by careless or 
random answers on the part of others. Increasing the bulk of 
such data is not likely to increase its reliability, but the selecting 
of even a smaller number of persons who could be depended 
upon to give careful replies would yield better results. Thus if 
one wished to discover the most important aims in the teaching 
of high-school English, returns from a small well-selected group 
of experienced teachers would be preferable to those from a 
much larger group taken at random. 

It frequently happens that the worker loses sight of the fact 
that his data are inadequate as to quantity and quality and 
applies elaborate statistical methods with the expectation that 
the final results will be of value. Such procedure, if followed 
intentionally, has been rightly described as “hiding behind a 
statistical smoke-screen,’”’ and is nothing less than a scientific ~ 
crime. The limitations of the data employed should always be 
frankly recognized and the conclusions of the study made with 
them in mind. No amount of subsequent juggling by compli- 
cated formulas can give good results when they are based upon 
originally faulty data. 

The successful statistician must have the capacity for careful, 
painstaking, and scientifically honest work. It is so easy to 
gather a few figures and tabulate them in such a way as to show 
a desired result or “prove” a certain theory that the tempta- 
tions on the path of scientific rectitude are great. The untrained 
reader is often so bewildered by tables and diagrams that he is 
incapable of verifying the method or the inferences in a statis- 
tical article and either accepts the conclusions on the reputation 
of the writer or perhaps concludes that “anything can be proved 
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by statistics.” Educational science would be greatly improved 
by the production of a smaller number of studies based upon 
better data and a more cautious use of statistical method. 

A final requisite for the successful use of statistics is training 
in methodology. The investigator needs to become familiar with 
the various technical methods and processes of calculation. He 
needs much training in the application of these methods to data 
and problems in the particular field in which he expects to work. 
He also needs some knowledge of the difficult field of statistical 
inference. It is this general pedagogical requirement which the 
textbook and course in statistics are expected to fulfill. Sucha 
course of study should familiarize the student with methods 
appropriate to educational problems, insure skill in statistical 
arithmetic, and provide opportunity for working out a worth- 
while problem under careful guidance. 


3. GENERAL STATISTICAL. PROCEDURE IN DEALING 
WITH A PROBLEM 


While there is no set order in which the steps in a statistical 
study must be carried out, experience has shown that a sys- 
tematic procedure like the following is logical and economical 
of time and labor. Most of these steps will be discussed and 
fully illustrated in subsequent chapters. 


(1) Planning of the study. When the student has some prob- 
lem selected, his first concern will be with a rough plan for the 
whole study. It may not be possible to define the problem very 
specifically until the data have been gathered and examined, 
but the more definitely the limits of the inquiry can be set in 
advance the easier will be the subsequent steps. The usual 
mistake is to select a problem much too broad and too difficult 
for any one individual or even a small group of workers to 
undertake effectively. The availability, sources, accuracy, and 
methods of gathering data should all be considered in the 
preliminary plan. 
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(2) Collection of the data. With the problem defined and a 
general plan made, the next step is to collect the necessary 
data. This is accomplished by the use of questionnaires, by 
personal tabulation from data already available in records, or 
by the application of standardized tests, rating schemes, and 
other such measuring devices (Chapter II). 


(8) Preliminary analysis of the data. If a questionnaire has 
been used in collecting the material, it is usually necessary to 
examine the returns very carefully before making tabulations. 
Incompleteness, inaccuracy, and ambiguity in the answers given 
should all be considered before the data are used. Similar anal- 
ysis is often necessary with the results of standardized scales ; 
unusual test conditions and errors in giving and in scoring the 
tests need to be checked up before tabulation is begun. 

A preliminary analysis of the material will also be desirable 
in.many cases to determine whether or not the data are ade- 
quate for the problem in hand. It may be that question-blank 
returns from a certain source are too scanty or fail to appear in 
such a form as to meet the requirements of the problem. In 
such cases a revised blank and more data will be required 
(Chapter IT). . 


(4) Tabulation for primary records. After the preliminary anal- 
ysis the data should be tabulated in such a way as to form both 
a permanent and a convenient working record. The permanent 
record may be kept in a bound volume with a page to each ease 
or in the form of a master sheet with the names and records in 
parallel columns. The working record is usually in the form 
of small cards. One of these is made out for each case and the 
data entered in compact form so that the cards may be readily 
sorted and the resulting distributions easily checked (section 7, 
Chapter IT). 


(5) Classification of the material. Distributions, tables, and 
serial arrangements may next be made from the primary record. 
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These furnish the basis for calculations and graphical represen- 
tations of the material. 


(6) Analysis of the classified data and planning of the calcula- 
tions. The particular statistical calculations to be employed 
are often not apparent until the data have been arranged in 
systematic form. The choice and right use of the proper ana- 
lytical methods are extremely important, and at this point 
sound statistical judgment is required. 

After the required calculations have been decided upon, they 
should be planned throughout before computation is begun. 
This is particularly advisable with data involving correlations 
(Chapters IX and X), the tables for which may be checked 
against one another and also used to furnish other statistical 
quantities such as the averages and measures of variability 
(Chapters VI and VII). 


(7) Calculation of the statistical constants. The computations 
required may be made with the assistance of calculating tables 
and machines. It is desirable to have complete checks on the 
arithmetical accuracy of the work. Some of these are afforded 
by formulas, but the best check is to have two persons perform 
the calculations independently (Chapter V). 


(8) Interpretation of results. The study has now reached the 
point where a careful scrutiny of results is required. These need 
to be interpreted in terms of the problem in hand. If the in- 
vestigator is fortunate, the results may come out in such a way 
that the conclusions to be drawn are clear-cut and unambiguous. 
Very frequently, however, the findings are incomplete or in- 
conclusive, so that it is necessary to make inferences with 
extreme caution. Careful application of the methods of statis- 
tical inference will then be necessary in order to guard against 
unwarranted generalizations (Chapter XIII). 


(9) Presentation of results in tables and diagrams. Before 
writing the report most workers will find it desirable to prepare 
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rough sketches of the tables and diagrams to be used in the 
study. It is often convenient to cut these out and to pin them 
into the text as it is written (Chapter III). 


(10) Writing the report. A satisfactory report will usually 
parallel in a general way the steps outlined above. It will 
contain a statement of the problem and its setting in the larger 
field; a description of the group studied; an account of the 
materials and methods employed; the results, inferences, and 
conclusions of the study ; and a summary of the results obtained. . 


With this general plan in mind we may next turn to a detailed 
account of the various statistical methods. 


CHAPTER II 
COLLECTION AND CLASSIFICATION OF DATA 
1. PRIMARY AND SECONDARY DATA 


The raw material employed in statistical studies consists in 
measurements or estimates known as data, which are numerical 
statements of facts in any department of inquiry, such as astron- 
omy, economics, biology, psychology, and education. In the 
last field examples are furnished by the scores of pupils on 
standardized tests, physical measurements of children, salaries 
of teachers, attendance records, etc. 

Data from whatever source may be described as primary or 
secondary. These terms are used in statistical method in much 
the same way as in historical research. In the latter field a fact 
taken from an ordinary text is considered as secondary material 
because it is removed at least one step from the original record. 
If the information were secured first-hand from documentary 
sources such as laws, original proceedings, letters, etc., it would 
be considered as primary historical data. 

In the case of statistical method, primary data may be de- 
scribed as those secured from questionnaires, measurements, or 
estimates before the material has been combined or treated in 
any way so as to obscure the units or method of collection. 
Secondary data, on the other hand, are those which have al- 
ready been collected and tabulated in some form available for 
use. They are usually removed one or more steps from the form 
of the original record, and hence comparison with similar ma- 
terial is of doubtful significance. 

If the problem were to determine the academic training of 
teachers beyond four years of high school, primary records 


might consist of returns from a large sampling of individual 
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teachers’ replies to a question blank. Secondary data for such 
a problem could be secured from the reports of state superin- 
tendents. The latter type of material would be relatively easy 
to obtain, but would be open to the objection that the types 
of teachers, units of tabulation, and other factors might not be 
comparable in the various reports. 

Primary data are of course much to be preferred to secondary 
material. In case the study is of wide scope, however, and re- 
quires an elaborate plan for securing the facts, the work of 
collection will usually be too much for a single individual. Such 
studies are often subsidized by grants from public and private 
funds so that a staff of trained workers may gather the material. 
Assistance of this sort is particularly necessary in the field of 
school costs,* where differences in methods of accounting re- 
quire personal tabulation of the data directly from the school 
records and invoices. 

Studies which involve primary data and which may be effec- 
tively handled by a single person include experiments with 
apparatus or standardized tests, questionnaire investigations of 
limited scope, and intensive problems where the method of 
personal estimate or observation is required. 


2. SOME EXAMPLES OF SECONDARY SOURCE MATERIAL 


The student who wishes to employ secondary material will 
find a large amount in government reports, school surveys, 
foundation publications, and funded inquiries. The Federal 
sources include the annual and sundry reports of the United 
States Bureaus of Census, Education, and Labor. Dr. Leonard 
P. Ayres made extended use of such material in preparing his 
volume, ‘‘An Index Number for State School Systems,” from 
Bureau of Education reports.+ He was able to obtain figures on 


* See N. B. Henry, A Study of Public School Costs in Illinois Cities. The Mac- 
millan Company, 1924. (This is one of the studies subsidized by the Common- 
wealth Fund.) 


{ Leonard P. Ayres, An Index Number for State School Systems. Russell Sage 
Foundation, 1920. 
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school costs and attendance for all the states running back 
over a period of fifty years. 

Reports from state superintendent and state departments 
are often useful in making preliminary studies of a type re- 
ported by William R. Burgess on the academic preparation of 
teachers.* Dr. Burgess summarized the reports from fourteen 
states and found that the average teacher in 1918 had only 
one and one-quarter years of training beyond high school. 

School surveys furnish a very valuable source of compara- 
tive data, but the variations in the methods employed for secur- 
ing the financial and test data make extreme caution necessary 
in using such facts. The volume of the Educational Finance 
Inquiry on “Financial Statistics,’ prepared by Miss Mabel 
Newcomer, } is another example of a useful compilation for 
comparative purposes. Similar studies may be found by con- 
sulting the extensive bibliography on school costs prepared by 
Dr. Carter Alexander. t 


3. UNITS OF COLLECTION 


In gathering statistical data it is usually necessary to decide 
in advance upon the units to be employed. For Dr. Burgess’s 
problem, cited above, the character, “‘teacher training beyond 
high school,” might have been expressed in a variety of units 
such as semester hours, quarters, semesters, or years. In deal- 
ing with normal-school and college training it seemed advisable 
to him to consider a year of training as the unit no matter 
where taken. The choice of such a crude unit of course makes 
fine comparisons of doubtful significance. Two years of train- 
ing in avery poor normal school are not equivalent to two 
years in a first-class institution. The decision as to how rough 
a unit may be employed will depend largely upon the purpose 


* W. R. Burgess, ‘‘The Education of Teachers in Fourteen States,” Journal of 
Educational Research, March, 1921. 

+ Publications of the Educational Finance Inquiry, Vol. VI. The Macmillan 
Company, 1924. 

t Volume IV of the Educational Finance Inquiry. The Macmillan Company, 1924. 
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of the study. Dr. Burgess was interested not in individual dif- 
ferences in teacher training but in securing an approximate 
index of the amount of such training in a whole state. For such 
purposes the unit employed was a very reasonable one. 

With test data the units of collection are given by the tests 
themselves in terms of points, years of mental or educational 
age, or as functions of group variability (see Chapter VII). In 
recent years it has been discovered that’ the units in many 
earlier scales were expressed to a fictitious degree of accuracy. 
Problems in a ‘‘scaled”’ series were assigned values such as 3.24 
under the assumption that abilities could be measured with an 
accuracy of one-hundredth of a ‘“‘probable error’’ unit, as it is 
called. The instability of mental characters makes such preci- 
sion unwarranted. Another measurement of the same person. 
would probably differ from the previous one by a whole unit of 
“probable error.’’ For most test material the simple unweighted 
item furnishes a unit which is sufficiently accurate for all statis- 
tical purposes, although derived scores such as mental or educa- 
tional ages are often very convenient. 

In the case of stable characters, such as height, greater care is 
needed in determining the unit of measurement to be employed. 
The classification of the data and the comparisons which follow 
will depend upon the degree of accuracy in the original material. 
It is usually best to make the measurements somewhat finer 
than the unit to be employed later in grouping the data. Thus 
if heights are to be classified in one-inch intervals the measure- 
ments might be made to the nearest quarter or eighth of an 
inch. This insures a fairly even distribution of the observations 
over the intervals as shown in section 8. 


4. TYPES OF SERIES 


A statistical series may be described as a set of data the items 
of which have some common feature, or character. Examples of 
widely different characters are height, intelligence, and religious 
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denomination. When measurements of height, estimates of 
intelligence, or verbal descriptions of religious denomination 
are made, the resulting data may be regarded as a statistical 
series. It is the function of statistics to summarize, compare, 
and draw inferences from such series relative to some problem. 

In order that appropriate methods may be used with different 
sorts of data, it will be desirable to have a classification of the 
various types of statistical series which may arise. The basis 
of this classification is in the representation of the character 
as next described. 

The three characters cited above differ in certain important 
respects. Height and intelligence are termed ordered characters, 
because the amount or degree of either trait may vary in an 
orderly manner. Religious denomination, on the other hand, 
does not lend itself to such gradation, but must be described in 
verbal categories not related to one another in any orderly 
fashion; that is, it would be indifferent whether ‘‘Congrega- 
tional’’ be placed before or after “‘ Methodist” in a classification. 
Such characters may be called wnordered. 

A second difference between the above characters arises from 
the manner in which the amounts, degrees, or categories are 
described, or, more briefly, according to the mode of indexing 
of the character. These modes of indexing may be numerical 
or verbal. Thus height is said to be numerically indexed when 

various amounts for different individuals are expressed by the 
"numbers arising from some scale, that is, by measurements. 
Intelligence may be indexed numerically or verbally. In the 
former case mental ages or intelligence quotients could be em- 
ployed, while, for the latter mode of indexing, verbal descrip- 
tions, as “inferior,” ‘“‘medium,’’ and “high,” might be used. The 
limits of these categories may or may not be stated in numerical 
terms according to the manner in which the data are gathered. 
As already noted, the three categories of intelligence can be 
arranged in an orderly fashion, but in the case of religious de- 
nomination the verbal categories could be placed in any order. 
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A third distinction may be made between continuous and dis- 
continuous characters. For the former the unit of measurement 
may be made as fine as we please, while determinations for the 
latter type must always be given in integers (whole numbers). 
Thus height is a continuous character, because gradations of 
height vary continuously; but size of class would be regarded as a 
discontinuous character, since fractional class sizes cannot occur. 

The preceding analysis of characters may now be used to de- 
termine a classification of statistical series, the various types 
depending upon the ordering and indexing of the characters 
involved. Thus when the character is ordered and numerically 
indexed the resulting series will be called quantitative ; when the 
character is ordered and verbally indexed the series will be 
termed qualitative ;| whereas for an unordered and verbally in- 
dexed character the series will be designated as wnordered. 

It should be noted that the basis of the present classifica- 
tion is in the ordering and indexing of the character and not 
in the nature of the trait itself. Thus both speed and quality 
of handwriting may furnish either quantitative or qualitative 
series because both may be either measured or verbally de- 
scribed. An objection may be made to the use of the-term 
““qualitative’”’ to describe series where the trait itself is not a 
quality in the ordinary sense. This term, however, seems a 
convenient one for verbal and ordered characters, and no con- 
fusion need arise if it be understood that “qualitative” is merely 
a label for such series. 


The table on page 15 gives a brief summary of the above 
types of series. 


5. METHODS OF COLLECTING DATA 


If secondary data are employed in a statistical study, it is 
only necessary to assemble the material from the records in 
some form convenient for use. Primary data may also be col- 

lected by tabulation of existing records, but they are usually 
gathered by enwmeration or counting in such problems as school 
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TABLE 1. CLASSIFICATION OF SERIES * 


ORDERING AND INDEXING OF THE RESULTING STATISTICAL 
CHARACTER SERIES EXAMPLE 
Ordered 
Indexed numerically ..... Quantitative 
Continuous Test scores 
Discontinuous Size of classes 
Indexed verbally . ...... Qualitative Estimates of intelli- 
gence 
Unordered 
Indexed verbally (possibly numer- Unordered Classification of re- 
ically) ligion, race, occu- 
pation, ete. 


census, by estimation as in experimental work and the appraisal 
of teaching efficiency, by measurement with physical and mental 
scales, or by questionnaires with inquiries of broad scope. 

The particular method of collection to be employed will de- 
pend upon the problem and the availability of the data. It is 
usually best to avoid such indirect methods for securing data 
as the questionnaire when it is to be filled in from printed direc- 
tions. The dangers of securing incomplete, unrepresentative, 
and faulty information from such sources are very great. 

In collecting data by enumeration, estimate, or measurement 
it is desirable that the work be done by trained persons and by 
uniform methods. For a problem in child accounting, knowl- 


*In his ‘Introduction to Statistics’? Mr. G. U. Yule distinguishes between 
statistics of attributes and statistics of variables. For the former the observer notes 
only the presence or absence of some attribute and counts the number of individuals 
who do or do not possess it; for the latter type, determinations of some variable 
are made. Examples given for statistics of attributes are the number of blind and 
not blind, sane and insane, or tall and short persons. Measurements of height or 
weight furnish the data for statistics of variables. 

The twofold classification given by attributes is rather restrictive and leaves open 
to doubt the designation of many series which-may arise. Thus if another group 
““medium”’ be added to “‘tall’’ and ‘‘short,’’ we are at a loss to know whether the 
resulting series is to be classified under attributes or variables. Again, if we consider 
the disabilities ‘tblindness,” ‘‘deafness,” and ‘“‘insanity’’ the same question arises. 
It appears more satisfactory, therefore, to consider the above series given by height 
as qualitative, since this character is ordered and verbally indexed, and to designate 
the disability series as unordered on account of the indifferent arrangement of the 
three classes. 
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edge of the terms and methods in this field would be necessary 
before comparable data could be collected from various sources. 
In the appraisal of methods of classroom instruction a uniform 
system and a careful technique of observation would be re- 
quired, while for problems involving the use of standardized 
tests the service of trained workers in the administration of 
such scales is usually needed. No elaborate statistical treatment 
can correct the faults of poor original data, and anything that can 
be done therefore to improve the reliability and accuracy of the 
material is tume well spent. 


6. SAMPLING 


By the sampling process is meant the use of a sample or/por- 
tion of a larger wniverse of material taken for the purpose of 
drawing conclusions as to the whole. | Thus if age norms are to 
be prepared for a certain test it is clearly impossible to examine 
all children of the required ages. It is therefore necessary to 
base the averages upon representative samples taken from the 
larger universe or population. If the samples are fairly large 
and properly chosen, the results will not only be very close 
to those which would have been obtained from the whole 
population, but it is also possible to predict from the sample 
the range within which the true value will very probably lie 
(see Chapter XIII). This makes it possible for the statisti- 
cian to generalize beyond his actual data, and to express the 
so-called ‘‘reliability”’ of his result in terms of mathematical 
probability. 

The principle behind the sampling process is that a fairly 
large number of items chosen at random from a large group or 
population is very likely to have the characteristics of the whole 
population. This may be called the Law of Statistical Regularity 
for Large Numbers. 

A simple illustration of this law is furnished by an experiment 
to determine the percentage of Ford cars appearing on a south- 
side boulevard in Chicago. The results found were typical of 
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what might be considered a universe of Fords frequenting that 
part of the city. The method adopted was to count 100 passing 
cars and note the number of Fords in such a sample with the 
following record : 


TABLE 2. DATA ON ForRD EXPERIMENT 


PERCENTAGE OF FORDS OBSERVED IN ANY NUMBER OF SAMPLES OF 100 CARS EACH FOR 
ONE EVENING’S SAMPLE OF 100 CARS A GIVEN PERCENTAGE OF Forps 


28 
27 
26 
25 
24 
23 
22 
21 
20 
19 
18 
17 


iw) 
Solna po Deen ADD I Ss fe 


TT otaliaeeeaee 


The average and most frequent percentage of Fords was 21, 
with a maximum variation in the other samples of only 7 per 
cent. Any one of the twenty-three samples then gives a fairly 
good indication of the required percentage. It must be kept 
in mind, however, that the above results were for only one 
section of Chicago during a certain time of the day and for only 
one season of the year. Three samples taken on the north side 
two months later gave a percentage of only eight. 

Another example illustrating the law of regularity in sampling 
is furnished by Mr. Ben Wood.* Cards for 6468 boys were filled 
with information regarding guardianship, number of children 
in the home, and other similar data. By putting the cards in 
alphabetical order and selecting every fourth one Mr. Wood 
was able to secure a sample the characteristics of which were 
in remarkably close agreement with those for the whole group. 


*Ben Wood, ‘The Reliability of Prediction of Proportions on the Basis of 
Random Sampling,” Journal of Educational Research, December, 1921. 
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A modified portion of his tables shows that a quarter of the 
cards, chosen as they were, was an adequate sample for com- 
parative purposes. 


TABLE 3. PER CENT OF BOYS LIVING UNDER VARIOUS HOME CONDITIONS 


PORTION OF 6468 CARDS USED 
ITEM 
One Fourth Three Fourths All 
I. Guardian 
Father: 42... oka eer 83.4 82.4 82.4 
Mother soc). i- ea omc 13.3 14.1 13.9 
Unele’ Sasi, chest ete Se 0.6 0.6 0.6 
‘AUNT 2 AO sw sey mein ee 0.4 0.2 0.2 
Steplatherserem same 0.7 0.9 0.9 
Stepmother seals mene 0.2 0.1 0.2 
II. Number of children in family 

One: eth. steer 6.0 6.3 6.3 
CE WOSRECUE SS Eoeeser css wane oe 11.3 11.8 ah Ey 
‘Three Pees tea eee 14.8 13.7 13.9 
OUT A ee acayse soe neo eee 13.6 14.4 14.2 
HiVeyh) may fo uh ee oe 14.3 14.6 14.5 
sib. e Sane ca On Ow Aa Go 11.9 12.6 12.4 
SOVClis tee ee eee 9.8 10.5 10.3 


In securing a random sample the principle to be kept in mind 
is that every individual in the group should have the same (or 
nearly the same) chance of being included in the sample. This 
is accomplished in several ways. One plan is to mix the data 
very thoroughly and then take a limited portion of them. This 
‘procedure is exemplified in the shuffling and dealing in ordinary 
card-playing. The purpose of the mixing or shuffling is to pro- 
duce what is called a random distribution, a portion of which 
furnishes the random sample. Such distributions are assumed 
to be already existent in many problems, such as that of the 
motor cars, where the arrangement of a given one hundred cars 
was affected by many chance factors. The same assumption is 
made in measuring rainfall. Although the drops fall unevenly 
they tend to moisten given areas equally in the long run, and 
hence a gauge of a certain area furnishes a random sample. It 
is, of course, true that for large cities such as Chicago, samples 
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from different parts of the city need to be taken in order to get 
an adequate measure of the rainfall for the whole city. On 
numerous occasions it has rained in one part of the city and 
not in others at the same time. 

Good results may often be secured by taking the items at 
regular intervals after the material has been arranged in some 
order. In Mr. Wood’s experiment every fourth card was se- 
lected under alphabetical arrangement. This plan is usually 
satisfactory unless there is some reason to expect a relationship 
between the character studied and alphabetical order. Thus, in 
a study involving pupil recitation there might be a tendency 
on the part of some teachers to call more frequently on pupils 
whose names begin with the earlier letters of the alphabet. 

If the population sampled contains a number of types, a 
purely random sample of the whole is probably not best be- 
cause some of the types may be omitted or not fairly repre- 
sented. For such problems sub-samples proportional in size to 
the numbers in the various types should be selected. For ex- 
ample, in a study of high-school pupils samples from each of 
the four years might be chosen and combined, the size of the 
samples being taken proportional to the relative numbers of 
pupils in the four high-school classes. 

The size of the sample will depend upon the degree of accu- 
racy required in the result, the precision varying as the square 
root of the number of cases. As indicated in the first chapter, 
forty to sixty cases are as few as can be expected to yield good 
results in experimental work. When only fifteen or twenty are 
used, the application of the usual laws of sampling becomes 
very doubtful. 


7. ARRANGEMENT OF THE ORIGINAL DATA 


The form of the permanent and working records will depend 
upon the number of data employed. The master sheet, as shown 
in Exercise 1 at the end of this chapter, is advisable for samples 
of fifty to one hundred cases. With a large amount of material, 
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however, a uniform blank card or page is usually required. Sir 
Francis Galton kept a record of his data in large bound vol- 
umes with a page for each person examined, the age, profession, 
nationality, and the results of various mental and physical tests 
being set down in the appropriate spaces. 

If the series is short, the tallying and distributions may be 
made directly from the master sheet by running down the page 
and checking off the items. This method, however, makes it 
necessary to go over the whole list to catch a single error and 
is rather awkward for the preparation of correlation tables 


because the order and accuracy 
Petar ire. a es 31 | of entry produce a strain on the 
reamim 
| evar | 14 


attention. 
47.31 


The preparation of small tick- 
ets for a working record will 
overcome most of the above dif- 
ficulties. These cards should be 

Fic. 1. Data Card fairly thin, of uniform size, and 
have a corner cut off to facilitate 
the separation into piles during the sorting. The data are set 
down from the permanent record in the form of numbers: with 
a definite spatial arrangement on the card. In order to identify 
the tickets with the permanent record the case number should 
appear on a corner of the card. This will make it possible to 
prepare a duplicate in case a card is lost. The size of the card 
will depend upon the number of items entered, but it should be 
as small as can be handled conveniently. A key for the items 
entered will of course be required for a sample card such as the 
one shown in Fig. 1. In sorting, the characters are easily iden- 
tified by their position on the card. 

In case the work of tabulation and sorting is to be done by 
mechanical devices such as the Hollerith Machine, the data 
card will be a convenient record when punching the informa- 
tion for the tabulating card. The holes in this card (Fig. 2) 
make it possible to sort very rapidly by electrical contact. 
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8. THE SIMPLE FREQUENCY DISTRIBUTION 


In dealing with a large body of data it is necessary to classify 
the material in some compact and orderly form before it can be 
effectively analyzed. The frequency distribution is the most 
convenient arrangement for the material, because it reveals 
some of the most important properties at a glance and makes 
all of the calculations very much easier than would be possible 
with the ungrouped items. A simple frequency distribution con- 
sists of a series of classes of the character and a set of correspond- 
ing frequencies. In the case of a quantitative series the scale is 
usually divided into a number of classes of equal width, for ex- 
ample, 54.5 to 59.5, 59.5 to 64.5, 64.5 to 69.5, ete. The number 
of items or measures (called the frequency) occurring in each 
interval is then determined by tallying. For qualitative or 
unordered series the classes are indicated verbally and the 
frequencies tabulated as in the first case. 

The ancient method of tallying is to record the frequencies by 
strokes until four have been made and then to make a cross 
stroke. This makes it easy to count the marks. For example: 


CLASS TALLY FREQUENCY 
G425=69.5) git. weal eh hae ites waists ee FHTF / 6 
OO 0— O42 018i bene sansa on) Ie kel nie ei> Came, AREER HH H+ / 11 
O4SD—-DO- Oa tm O enc) any oR eo ee eee eee AHH- // % 


The tally marks, of course, should not appear in the final 
distribution. 

The steps in making a frequency distribution for a quantita- 
tive series consist in (1) noting the range of the data, that is, 
the distance between smallest and largest items; (2) deciding 
upon the number of classes into which the material is to be 
grouped; (8) determining the numerical limits of the classes; 
and (4) tallying the frequencies in the appropriate classes. 
Steps (2) and (8) are important because all of the subsequent 
calculations will be affected by the width and limits of the 
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classes. The data, when grouped, are considered to be either 
concentrated at the midpoints of the intervals or spread evenly 
over them. The calculations from grouped data will then not 
agree exactly with those from ungrouped series unless the width 
of the classes is equal to the collection unit. If the grouping 
is this fine, however, the classes may be so numerous that the 
advantage of employing a distribution is lost and the frequencies 
are likely to present a very irregular appearance, not typical of 
the continuous gradation expected from ordered characters. It 
is therefore better to use a wider interval smoothing out the 
accidental irregularities, probably due to sampling, and making 
the subsequent calculation easier although slightly less accurate. 
When there are from fifteen to twenty-five classes with material 
consisting of one hundred or more items, the error due to group- 
ing is very slight, and even this may be adjusted by certain 
corrections (see Chapter XVI, section 8). 

The choice of class limits depends upon the accuracy of the 
original data. If the measurements are very fine and the classes 
fairly broad, the limits of the classes may be expressed in the 
form 55-59.99, 60-64.99, 65-69.99, ete. This method makes it 
possible to assign measurements very definitely to the appro- 
priate classes, since all items equal to the lower limit and up to 
but not including the upper limit are located in a given class. 
One difficulty with this designation, however, is that confusion 
sometimes arises regarding the numerical value of the upper- 
class limits in calculation. Students may take these to be actu- 
ally 59.99, 64.99, ete. An alternative plan is to write the classes 
in the form 55-60-, 60-65~, 65-70-, etc., with the understand- 
ing that 60- is equal to 60 for purposes of calculation, but 
means just less than 60 in the tabulation. 

A more important objection to the above method arises when 
the measurements are not very fine. If we assume that the 
items are given correct to the nearest integer, an even distribu- 
tion of the observations over a class interval would be repre- 
sented as shown in Fig. 3, p. 24. 
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The class values, or midpoints of the intervals used in subse- 
quent calculations, will thus be 57.5, 62.5, etc. These values, 
however, are not representative of the items in the respective 

classes. In the first interval, 
for example, there are three 
items below 57.5 and only 
65 56 57 58 59 60 61 62 63 64 66 two above, spacing the ob- 
pup eae eke ee servations unequally about 
Fic. 3. Illustrating the incorrect class this class value. 
limits which appear when the items are : : 
measured to the nearest integer In order to adjust for this 
difficulty, the class limits 
should be set as in the diagram shown in Fig. 4 by moving one 
half of the collection unit down on the scale. The location of any 
frequency is as uniquely 
determined as before, and 
the class values 57, 62, etc. 
are more truly represent- 
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55 56 57 58 59|60 61 62 63 64] 65 
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intervals 54.5-59.5, 59.5- mt = 

64.5. ete Fic. 4. Illustrating the correct method 
.5, ete. 


of stating class limits when the items are 


If the measurements are given to the nearest integer 


so fine that practically the 
whole interval 55-59.99 could be filled with observations, then 
57.5 would be an approximately correct class value and the 
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Fig. 5. Illustrating class values and limits with nienarremente 


a, very fine; 6, to nearest integer 


limits 55-60~ satisfactory. If, on the other hand, the meas- 
urements are rather coarse, the size of the collection unit needs 
to be taken into account by moving the interval back one half 
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the amount of this unit. If this is not done, all class values 
(and the resulting average for the whole series) will be one 
half of the collection unit too large. 

The following frequency distribution has been made from the 
Otis Test Scores appearing in Exercise 1 at the end of this 
chapter. Inasmuch as the scores are given to the nearest inte- 
ger (point), the classes will run from 79.5-89.5, 89.5-99.5, ete. 


TABLE 4, FREQUENCY DISTRIBUTION FOR OTIS TEST SCORES 


CLASss TALLY FREQUENCY 

UO BASSO: 7 Som, OS ee ene / il 
HG Qa OFS er pies peel wn eS / 1 
TSOP GON mem meeen tlt 2. 1G ee Gb crs eels: HUN 4 
BA ONO SUDO PMs SMe nin te Vieni ous: cay wat fata ia, 11 
WOE AIEO: 3) Gs ap aera See ne ae HE //// 9 
ZO STO: Ae Se SOS Gen en rey sane ieee | 11 
RACING IR Soe eee aa Se eer ttt 5 
ADD-ONS 2 ee as oe ee ere //1/ 4 
MOSSES 385 SASe Raa: Sea ee Hef 2 
SOEUR ES. es ee Pad ok pre ee ee if 1 
TOISAS ONE ae) Se ee i il 
Total 50 


It might be argued that a person receiving a score of 80 could 
not have done less than the amount required to receive such a 
score, and that he very probably did a little more, so that his 
truer score for that performance should be 80.5 instead of 80. 
This reasoning would lead to class intervals, 80-90—, 90-1007, 
etc., but would be contrary to the usual practice of taking scores 
at their face value. In the present discussion, therefore, we 
shall assume that scores are correct to the nearest integer. 
It will be noted that only eleven classes were used in this 
distribution because of the small number of cases involved. 


9. THE CLASSIFIER 


In dealing with small samples it is frequently desirable to 
rank the items and to prepare short frequency distributions. 
For such purposes a device known as the classifier will be found 
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very convenient. It consists of tabular arrays of small cells 
identified by units’ digits on one axis, and by tens’ digits on the 
other. The location of any item is then readily indicated by a 
tally mark in the appropriate cell. The accompanying classifier 
has been made for the Otis scores given in Exercise 1, p. 29. 


TABLE 5. CLASSIFIER FOR OTIS TEST SCORES * 


UNITS 
TENS TOTALS 


0 1 2 3 4 5 6 ff 8 9 


* This useful device was first brought to the attention of the writer by Dr. 
Leonard P. Ayres in a series of lectures given at The University of Chicago in 1920. 
It is recommended for use when dealing with fifty to one hundred cases. 
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The scores are entered in the classifier just as they come from 
the master sheet. Thus the first Otis score from the list is 171. 
Looking down the left-hand margin for the tens’ digit 17, and 
moving across under the units’ digit 1, locates the tally in the 
proper cell. To check the work the tallying may be repeated 
by making a small dot over each tally stroke. 

It will be noted that the material has been arranged in classes 
ten units in width and that the distribution in the totals on the 
right is the same as that found in section 8. The distribution of 
the totals at the bottom of the classifier is a random arrange- 
ment, the number of scores ending in 0, 1, 2, 3, etc. tending to 
be the same in the long run. 

The numbers in the cells indicate the ranks of the various 
items, and are determined after all of the tallying is completed, 
by counting down from the highest score. The advantage of the 
classifier for ranking is that if the tallying is correct, none of the 
scores will be omitted as would be quite likely if they were 
arranged in rank order by searching in the list of fifty for suc- 
cessively smaller items. 

When a score of 152 has been reached in the ranking, three 
tallies will be found. Inasmuch as these have the same value 
it is customary to assign to each the average rank of 11, 12, 
and 18, which is 12. In the same way the two scores of 151 
would share the next two ranks 14 and 15, each being given the 
average rank of 14.5. 

In addition to the grouping and ranking of the data the classi- 
fier will also be found useful in determining the median. This 
average for ranked items is the middle score, or is halfway be- 
tween the two middle scores, for an even number of items. In 
the problem above the median is 140.5 by inspection. 

It will be noted that the median is here defined as the middle 
score. For an odd number of cases this definition offers no 
difficulty, but with an even number the use of the value half- 
way between the two middle scores is a convention supple- 
mentary to the definition. 
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10. CUMULATIVE FREQUENCY DISTRIBUTIONS 


It is often useful to have the data arranged in a cumulative 
rather than a simple frequency distribution. / This is accom- 
plished by tabulating all of the frequencies less than the upper 
limit of each class interval. For the Otis material the cumu- 
lative distribution would be as follows: 


TABLE 6. CUMULATIVE FREQUENCY DISTRIBUTION FOR THE 
‘Otis TEST DATA 


ScorE Less THAN CUMULATIVE FREQUENCY 


ee 


189.5 50 
179.5 49 
169.5 48 
159.5 44 
149.5 33 
139.5 24 
129.5 13 
119.5 8 
109.5 4 

99.5 2 

89.5 1 


The cumulative frequencies are of course easily tabulated after 
the simple frequency distribution has been made. Both meth- 
ods of representing series will be extensively used in applying 
the descriptive methods of the following chapters. 


EXERCISES 


1. Make a classifier for the fifty Terman scores in the table on 
page 29 and obtain the ranks of the scores. Determine the median 


from these ranks. . (128 or 122.7. Ans.) 
2. Work out a scheme for ranking the Chicago scores and obtain 
the median. (58.25. Ans.) 


3. Make a simple frequency distribution for the Terman scores 
from the classifier. The classes will be 169.5-179.5, 159.5-169.5, 
etc. Make a similar distribution for the Chicago scores with classes 
74,75-79.75, 69.75-74.75, ete. 
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SCORES OF FIFTY PUPILS ON THREE INTELLIGENCE TESTS 


TEST TEST 
PupPiL PUPIL 
Otis Chicago | Terman Otis Chicago | Terman 

1 ieral By ial 26 cou caucus 133 AT 101 
2 169 75.5 153 PA gheges enn 151 53.5 137 
3 128 50.5 131 PAS oe We 145 56.5 119 
4 141 46 105 29 aes 152 56 124 
5 106 39.5 71 SOs oe 157 66.5 170 
6 146 55 130 Sie stieine | 144 60.5 155 
th 87 34 80 BY Aas 5 eee 140 60.5 119 
8 114 42 101 BRE G Une aro 111 38.5 142 
ie 5 aes 187 70 153 34 Rs 150 63.5 140 
LOM ss 133 51.5 132 3D eee 152 65.5 ee 
eee rcioceh 151 59 136 SOE cree 137 48 115 
RO a gee 131 52.5 128 SS Gin eee 146 54 125 
3) 1 150 63 145 Soin peeeee ee 128 44.5 87 
1 hy Sean 118 44.5 110 SOM ee ei. 145 57.5 120 
INS es OE oe 142 65.5 122 Ua ae 153 50.5 yy 
Ogee ea. i 166 61 152 AL te Ate 149 53 135 
leh 0 ee 158 55 157 AD Eee 114 45.5 100 
ie (ek eae 101 39 88 AD ate de 135 40 125 
Rie Sue 159 57.5 156 Ad ae 131 47 120 
CAV. ie ee 126 41.5 92 45 oe 161 61 149 
Pal a hres en 136 65.5 115 AGatae eos 95 37 87 
Pale. Son ae aaa 137 63.5 109 Ata 134 50 1038 
Sime oT 152 75.5 151 Ee ot 124 48.5 119 
A as ci. 137 45 132 49 Ria. IAS 43 95 
AD 132 61.5 130 HO Ce soeees 167 58.5 178 — 


4. Arrange separately the Terman and Chicago scores in the form 
of cumulative frequency distributions. 


5. Make a frequency distribution for the following scores, using an 
interval of one unit: 11, 12, 12,-13, 18,18, 14, 14, 14, 14, 15, 15, 
15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 18, 18, 19. Calculate the average 
(mean). What will the average be if the intervals are taken 1111.99, 
etc., instead of 10.5-11.5, ete.? What is the error in the average by 
the former tabulation method? (Error is .5.) 


6. Tabulate separately the scores on page 80 on spelling tests A 
and B for 125 pupils, using an interval of 5. 

7. Retabulate the scores in Exercise 6, using an interval of 10. 
Which interval is better? 

8. Make cumulative frequency distributions from the two spelling 
test distributions of Exercises 6 and 7. 
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SCORES OF 125 PUPILS ON TWO SPELLING TESTS OF EQUAL DIFFICULTY 
(Maximum Score = 105) 


CHAPTER III 
TABULAR AND GRAPHICAL PRESENTATION OF DATA 
1. PURPOSE OF TABLES AND DIAGRAMS 


Although the preparation of tables and diagrams will usually 
be the last step in working out a statistical problem, it is well 
to consider such work at this point because of its relative sim- 
plicity and concreteness. For many elementary studies, more- 
over, such as school and publicity reports, the tabulation and 
graphical representation of secondary material is about the 
only statistical method required. It is, therefore, desirable that 
everyone dealing with educational statistics should become 
acquainted as soon as possible with simple tables and graphs. 

In the following discussion the word ‘“‘diagram”’ is used to 
describe all sorts of graphs, charts, plots, or maps used for the 
display or comparison of data. 

Tables and diagrams have a twofold purpose: one is to assist 
in the analysis of the material and simplify the calculations by 
representing the data in concise and orderly fashion, while the 
other is to summarize and make clear the findings of a study. 
Thus the chief reason for arranging material in a frequency 
table is to facilitate analysis and calculation. The important 
characteristics of the series may then be readily determined and 
the required calculations made more easily than from the un- 
grouped data. On the graphical side a method of calculation 
has been developed known as nomography. By means of 
curves drawn to suitable scales a great many statistical calcu- 
lations may be made very quickly. In many cases, however, 
the construction of the nomograph is very laborious and the 
desired calculations will not be given to a sufficient number of 


significant figures. With the modern development of calculat- 
31 
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ing machines and statistical tables for computation, almost 
every sort of calculation will be found to be easier, more rapid, 
and much more accurate by numerical rather than by graphical 
methods. 

The proper use of tables to summarize numerical results is 
important because the success of a statistical study may depend 
a great deal upon the skill with which the tabular material is 
arranged. Good tables are usually brief and so titled as to be 
self-explanatory. By a suitable arrangement of headings a large 
amount of important information can be given in a very short 
space, comparison between similar items facilitated, and visuali- 
zation of group relationships made possible. 

Graphs or diagrams for presentation are intended to make 
the numerical comparisons clearer and more vivid. They are 
not primarily intended to summarize the statistical findings, 
which should appear in tabular form accompanying the diagram. 
If too many details are given in a chart its clarifying value is 
lost and a diagram that is not clear is probably not worth 
making at all. 


2. THE CONSTRUCTION OF TABLES FOR PRESENTATION 


While there is not universal agreement as to the terms used 
and the best form for a table, the following suggestions have 
the merit of successful usage in the publications of the Russell 
Sage Foundation. 


DEFINITIONS OF THE PARTS OF A STATISTICAL TABLE 


1. A statistical table is a quantitative presentation of facts by 
means of numbers arranged in a column or columns and distributed 
according to one or more groupings of the subject matter. 

2.:A table title is a statement appearing at the head of a statistical 
table, showing the subject with which the table deals. 

3. A column in a statistical table is the series of numbers, gener- 
ally relating to the same unit, arranged vertically in the table. 

4. A line in a statistical table is a series of numbers arranged in 
a horizontal row in the table. 
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5. The body of a table is the aggregate of the columns and the lines. 

6. A column heading is a word or group of words at the head of a 
column of numbers in a table, showing the unit dealt with and the 
relation of the column to the classification followed. 

7. A brace heading, or box heading, is a word or group of words 
appearing above two or more columns of numbers in a table, which 
it has the effect of uniting as with a brace, and to each of which it 
bears the same relation. In connection with the column headings, 
the brace heading shows the unit dealt with and the relation of 
each column to the plan of classification followed. 

8. A line title is a word or group of words at the left of a horizontal 
line or row of figures in a table, showing the relation of the line to 
the plan of classification followed. 

9. A total is a statement of the aggregate of two or more numbers 
appearing in a column or line. 

10. A grand total is a statement of the aggregate of several totals. 


TABLE 7. EXPENDITURE PER INHABITANT FOR OPERATION AND MAINTE- 
NANCE OF SCHOOLS IN CLEVELAND, AND IN SEVENTEEN OTHER CITIES OF 
FROM 250,000 To 750,000 INHABITANTS, 1914 


Estimatep | EXPENDITURE FOR OPERATION PASS aS 
one POPULATION AND MAINTENANCE EXPENDITURE 

IN 1914 (IN Total (in Per PER 

Te DEE a Ne) al teriinaaands) alii dahabitant rls woae 
Baliwmoress |) os & 580 $1955 $3.37 7 
IBOStONME © feuds «bene 734 5517 7.52 2 
BS utislo mee. oa. ol 3) © 454 2450 5.40 12 
@leyelandarr 6-4 - 639 3570 5.59 8.5 
IOVS RON” <4. 6958 ca ome 538 2553 4.75 14 
indianapolis. =. . 259 1410 5.44 11 
WerseysCity, «les = - 294 1421 4.83 13 
Kansas City): . 1 = = 282 1761 6.24 4 
Los Angeles ..... 439 3707 8.44 1 
Milwaukee. ..... 417 1795 4.30 15 
Minneapolis ..... 343 2148 6.26 6 
INewar eure sae 6 ues 389 2699 6.94 3 
New Orleans. .... 361 1098 3.04 18 
Bictsbure hie see oe 565 3602 6.38 5 
San Franeisco ... . 449 1879 4.18 16 
Seattle eee. sve 313 1751 5.59 8.5 
Ste eOUis en cs oe 7135 4085 5.56 10 
Washington ..... 353 2392 6.78 4 

AVGTAGC CLM t es ceric: vot. — — $5.59 — 
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The model table on page 33 illustrates all of the terms used 
with the exception of the totals, which were not necessary. It 
will be noted that the basic information upon which the com- 
parisons are made is given in the table so that it could be veri- 
fied by the reader in case of doubt. The style notes used in the 
construction of the table are given in the following list: 


STYLE NOTES FOR MAKING TABLES 
I, ARRANGEMENT OF DATA 


1. A short table is clearer and more forceful than a long one. 

2. Original data should be presented in full. 

8. It is easier to compare numbers arranged one above the other 
than numbers placed side by side. Tables should be arranged so that, 
as far as possible, numbers to be compared are in the same column. 

4. Items listed in a table should usually be arranged in descending 
or in ascending order of their rank in the trait in which they are 
being compared. 


II. TITLES AND HEADINGS 


1. The titles should always go above a table since a table is essen- 
tially a list. 

2. Titles and headings should be so worded and the table so ar- 
ranged that the result will be a complete whole, independent of the 
accompanying text. 

3. Table titles should place emphasis upon the fact or facts which 
the table is intended to show. This can be accomplished by placing 
the important facts at the beginning of the title. 

4, Words like “‘table showing,’ ‘“‘number of,’’ and “distribution 
of’? should be omitted wherever the meaning of the title is clear 
without them. 


III. PUNCTUATION 


1. Table titles should be capitalized throughout. 

2. In column headings and in line titles, except in the case of 
proper names, only the initial letter of the first word should be 
capitalized (also good form to capitalize first letter of each word). 

3. Do not end a title with a period. If the title consists of two 
sentences, put a period after the first sentence. 
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4. Do not use periods in column headings or in line titles except 
for abbreviations and to separate sentences as above. Avoid abbre- 
viations when possible. 

5. Do not use periods in the body of a table except to separate 
dollars from cents or units from tenths. 

6. Where one line of items is to be compared with those in the 
rest of the table this line may be in heavier type, so that it may be 
more readily seen. 


IV. SYMBOLS 


1. Ditto marks should not be used either in the body of a table or 
in its headings and titles. 

2. Where sums of money are stated in columns, the dollar sign 
should be placed before the first item in the list and before the total 
or average. 

3. Footnotes to the table should be indicated by letters and not 
by figures (also good form to use symbols such as %*, §, etc.). 

4. Where data are not available do not fill in the space in the 
table with 0’s. Reserve 0 for the definite information that it gives, 
that is, nothing; use - ------ OF -+see- to show that no figures 
are at hand. 

5. A row of dots or dashes on the lower part of the line may be 
used in the first column to guide the eye from each item to its cor- 
responding figure. These dots should not extend beyond the first 
vertical rule. 


V. “TOTAL” AND ‘‘PER CENT”’ 


1. ‘*Total”’ should always be written in the singular. 
2. ‘**Per cent”’ should be written in two words, with no period. 


VI. RULING 


1. There should be a double rule at the top of the table. 

2. A single horizontal rule should separate column headings from 
the body of the table. 

3. At the bottom of the table there should be a double horizontal 
rule. 

4. Totals and averages should be separated from the numbers of 
which they are the aggregates, by single heavy rulings (single light 
ruling is also good form). 

5. There should be vertical rules between the line titles and the 
figures, and between each two columns of figures. 
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6. Tables should not be closed in at the sides by vertical rules. 

7. Each column heading should be boxed in except at the two 
outer sides. 

8. These rules may be summarized as follows: There are three 
kinds of lines used in ruling a table: double lines at the top and bot- 
tom of the table; single lines between column headings and figures, 
and between columns; and heavy lines before totals and averages. 


VII. SPACING 


1. In long tables it is well to leave a double space after each five 
or ten lines of figures, to facilitate the reading. 

2. Numbers should be placed in the middle of the column with 
corresponding units directly under each other. 


8. COLUMN AND BAR DIAGRAMS 


For a full account of the great variety of diagrams which may 
be used the reader is referred to such texts as Williams’s, listed 
with the selected texts in the bibliography. The discussion here 
will be confined. to a few simple types which serve most of the 
purposes in an ordinary statistical study and which can be made 
without much training or great outlay of drawing materials. 
If elaborate figures are required it is probably better to have 
them drawn by an artist from a rough sketch rather than spend 
time in acquiring the skill necessary to use a drawing board and 
instruments. For the great majority of articles, books, and 
theses, however, only the simplest types of diagrams are neces- 
sary, and these may be made on ruled paper in black ink with 
very little practice. aur 
' The column diagram consists of a series of columns propor- 
tional in height to the quantities represented. A scale usually 
appears at the left and a legend either on the background near 
the columns or below as in Fig. 6. In this figure two varying 
quantities are shown very effectively on the same chart, the 
hatched portion representing the undesirable condition. Such 
a diagram may be made with india ink on ruled graph paper, 
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blue lines being preferred, because these will be invisible if the 
chart is photographed. 

Fig. 7 is another ingenious variation of the column diagram. 
Each block represents a school identified by number so that it 
is possible to compare any school with another or with the whole 
group. This type of dia- 
gram can be effectively 
used to,represent a group 
of test scores in such a 
way that each pupil can 
recognize his score by 
number without reveal- 
ing this fact to the rest 
of the class. 

In ‘éase>the columns 
are used to represent 
the frequencies of the 
various classes along the 
horizontal scale the re- 
sulting diagram is known OE 7 8 9 Wil i2 1B 14 16 16 17 18 19 
as a histogram. | The col- Fic. 6. Showing the holding power of 
umns are then propor- the schools 


tional in height and The columns represent the children enumerated 
he f : by the school census as of each age from six 
area to the requencies, through twenty. Portion in outline represents 


and this property makes children in public schools. Portion in black rep- 
. resents those not in public schools. (Cleveland 

the histogr am an excel- Education Survey Report, 1916) 

lent representation of a 

frequency distribution. The histogram for the Otis scores in 

Table 4 is given in Fig. 8. It will be noted that the horizontal 

scale is given in even integers and the column moved slightly 

to the left so as to have the intervals 79.5-89.5 etc. 

ed An alternative representation of the frequency distribution is 

given by the frequency polygon. This consists of lines connect- 

ing the frequencies taken at the midpoints of the class inter- 


vals. | In Fig. 9 a histogram and frequency polygon are plotted 
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on the same background for comparison. It will be noted that 
the area under the histogram between I.Q.’s from 90 to 100 is 


TENE 
36 | 12 | 
[54 | 23 | 
[2a] | 75 31) 
[15] 3 [33] 5 [81 [34 | 
[17 | 43 | 35 | 28 | 88 | 40 | 
[37 | 52 | 63 | 49 | 87 | 46 | [36 | 56 | 27 | 

_ [62 [32] 68 [85 [59 | 76 | 88 | 64 | 78 | 21 [93 | 57 [38 [84 | 14 [48 | 
faifes] [70[7] [67 }60 [so {92 ]80 | 91] 95 Jes | 90 [69 ]96 [79 [66 94] 71]65} | 19 | 


63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 88 84 85 86 


oo 
fox) 
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Aaja W109 | 09 iS 
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Fic. 7. Average scores made in spelling by ninety-six elementary schools* 
The figures below the diagram show the percentages, and the ones in the diagram 
show the number of the schools 
exactly proportional to the observed frequency over that range. 
The area under the polygon over this same range is somewhat in 
defect, however, and similar discrepancies occur for the other 

intervals unless three points 

on the polygon happen to be 

on a line. For these reasons 

the histogram is a better rep- 

resentation of the frequency 

distribution when a curve is 

to be fitted to the data. In 

case a rough diagram is re- 

quired to show the overlap- 

ping of several distributions, 

80 90 100110 120 130 140 150 160 170180 190 the polygon is probably bet- 
Otis Seore ter, but for most other repre- 

Fic. 8. Histogram of Otis Test Scores sentations the histogram is 

preferable. 

[In bar diagrams the varying quantities are represented by 
horizontal bars|as i in Fig. 10. The chief reason for preferring a 
bar to a column arrangement is one of convenience. If the line 


* From C. H. Judd, ‘‘ Measuring the Work of the Public Schools,” Cleveland Edu- 
cation Survey Report, 1916, p. 84. 
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Fic. 9. Histogram and frequency poly- 
gon for the intelligence quotients of 
Table 20, Chapter VII 


39 


titles are large as in the ac- 
companying figure, the use 
of columns would be awk- 
ward. For a fairly large 
number of items the bar 
diagram will also be found 
to be more effective. 
Quantities which exhibit 
variation in one dimension 
should be represented by 
column or bar diagrams 
which are themselves one- 
dimensional. The use of 


three-dimensional diagrams, such as a row of persons of vary- 
ing size to show increase in population, may be very misleading 


Los Angeles /II/ ILI, LILLY IIL, $64.78 
Seattle ' ee LU ee 61.18 
Pittsboreh 58.97 
Boston VILL TILL LLL LLL LLL LLL LLL LLL 56.78 
Kansas City a LLL ee eee 52.96 
Minneapolis ZZ MM LLL LLL LLL LLL, 52.70 
St. Louis LLL, ae a oe ae a 52.40 
Washington VIL, nr oe LLL LLL LL yy LULL TD, 51.34 
Buffalo WD =a YUM ee YL LLL 51.32 
Newark LLL LLL TLL ae LLL 50.25 
Indianapolis WZ VILL LLL LLL, Ty 46.59 
Cleveland WL MU ae oe YW Tl oo 46.38 
San Francisco VIL MM LLL MUM IMU 45.08 
Detroit ae == LLL LLL LL UT 44.66 
Jersey City WLLL TLL LLL LLL YM MM 43.17 
Milwaukee mame MMU == LLL. 88.51 
New Orleans W/L Zoe 93.07 
Baltimore WM VILL YWMMMMM|eaéba MM 82.54 


0 $10 $20 


$30 


$40 $50 $60 


Fic. 10. Expenditure per child in average daily attendance for operation and 
maintenance of public schools, for Cleveland and for seventeen other cities * 


because there is doubt as to whether the height, area, or vol- 
ume of the figures is proportional to the change in population. 


* From Earl Clarke, ‘Financing the Public Schools,’”’ Cleveland Education Survey 


Report, 1916, p. 37. 
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4. COORDINATES 


In order to remind the reader of his codrdinate geometry, 
which will be very much needed in the work which is to follow, 
the next few paragraphs will be devoted to a summary of the 
elements of that subject. 

If two straight lines OX and OY intersect in a plane, it is 
possible to describe the location of any point P in the plane 
with respect to the point of intersection O. For most 
representations it is convenient to have the lines inter- 
sect at right angles. The horizontal line OX is known 
as the z-axis, or axis of abscissas, while the vertical 
line OY is called the y-axis, or axis 
of ordinates. The distances of the 
point P from the two axes are 
known as coordinates of the point. 
Thus in Fig. 11 the abscissa of the 
point P is OM, or four units, while 

Qui 2s 478 6X ~«Cits ordinate is ON, or three units. 
Fic. 11. Illustrating ordinate These two coordinates will locate 
and abscissa uniquely the position of any point 

P with respect to the origin 0. 

It will be noted that in Fig. 11 only positive quantities can be 
represented. In case negative numbers occur, the codrdinate 
system may be extended as shown in Fig. 12. The piane is thus 
divided into four quadrants numbered in counterclockwise direc- 
tion about O. The codrdinates of a point in the second and 
fourth quadrants are opposite in sign, while those for a point 
in the third quadrant are both negative. The codrdinates of 
the four points in the diagram are as follows: 


| 

| 

| 

| 

| 
a 


Ordinate 


PoINnT ABSCISSA, X ORDINATE, Y 
Py +4 +2 
P2 —5 +3 
P3 —3 —2 


P4 +7 Sil 
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In plotting mathe- 
matical relationships it 
is usually necessary to 
employ the more ex- 
tended scheme with 
four quadrants, but in 
dealing with statistical 
data which are usually 
positive, the first quad- 
rant will suffice. 

The following table 
gives the lung capacity 
in cubic inches of 521 
boys in the laboratory 
schools of The Univer- 
sity of Chicago. The 


Third Quadrant Fourth Quadrant 


gieqice 6-4 7B =20-1 OMT 2 Ss ab 67 


Fic. 12. Lllustrating plotting in four 
quadrants 


ages of the boys ranged from five to nineteen years, the meas- 
urements being made within a few days of each birthday. 


TABLE 8. LUNG-CAPACITY DATA FROM THE LABORATORY SCHOOLS 


AGE AVERAGE LUNG CAPACITY 
5 76 
6 73 
f 88 
8 95 
9 106 

10 122 
11 129 
12 148 
13 165 
14 184 
15 Pill 
16 230 
17 252 
18 264 
19 287 


- These data have been plotted in Fig. 13 and the points connected 
in the form of a polygon. The trend appears fairly straight with 
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the exception of a general dip during the years of adolescence. 
This dip has been verified by other material.* 

Such a plot as that shown below is of value in analyzing the 
data and in giving to the reader a clear idea of the relation 
between the variables involved. We shall turn next to the 

; general consideration of such func- 
Gabi ieeece tional relationships. 


5. FUNCTIONAL RELATIONSHIPS 


When two variables are so related 
that the value of the first variable 
depends upon the value of the sec- 
ond variable, then the first variable 
is said to be a function of the second. 
The area of a square, for example, is 

0; Te bibiibiszo 2 function of the length of the side; 

Ageinyears that is, area equals (side)?, or y= 2?. 

Fic. 13. A plot of lung- Here the relationship is exact, all 
capacity data : : 

true squares conforming precisely to 

the law. Such functional relationships may be called mathe- 
matical, and are generally written in the form y = f(z). 

The second variable, to which values may be assigned at 
pleasure, is called the zndependent variable, or argument ; and the 
first variable, whose values are determined as soon as values of 
the argument are assigned, is called the dependent variable, or the 
function. In the above example the side x, representing the side 
of the square, is the independent variable, while y, representing 
the area of the square, is the dependent variable. 

Other examples of functions are breathing capacity, which 
is a function of the age of the person; the number of words 
typed per minute, which is a function of the hours of practice; 
and the score on an achievement test, which is a function of the 


* Karl J. Holzinger, “On the Relation of Vital Capacity to Certain Psychical 
Characters,” Biometrika, Vol. XVI, p. 139. 
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time spent in studying the subject tested. Such functions differ 
from exact mathematical functions in that they depend upon 
many more variables than the ones given, and the relationships 
indicated are only approximate. Breathing capacity, for in- 
stance, depends upon a great many factors other than age. A 
curve or an equation expressing the most probable breathing 
capacity for given ages will then furnish a basis for rough es- 
timation rather than exact prediction. An important part of 
statistical method is concerned with the selection of those 
mathematical functions which will give the best “fit” for a 
given body of data (see Chapter XVI). 


6. THE STRAIGHT LINE 


One of the simplest mathematical functions is that wherein 
the change in y is directly proportional to the change in z, 
for example, y=3x or y=. . 
The graphs of these functions will 


Plame Rit | yix 
i Roaliie 
6 | 2 
9|3 
Wales 
| 2 
a || ZA 
5 | 10 
0712345678 910X Qo 8 2 BG WK 
Fic. 14. Graphs of the lines Fic. 15. Graph of 
g=ocand p=> x y=2u+3 


be straight lines through the origin, as shown in Fig.14. In 
obtaining the codrdinates of various points it is only necessary 
to substitute arbitrary values for the argument x, and find the 
corresponding values of y. While only two points are necessary 
to determine a straight line, one other value has been given as 
a check. 
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The general equation of a straight line may be written in the 
form y = ax + b, where b is a constant representing the distance 
from the origin to the point of intersec- 
tion of the given line and the y-axis 
(y-intercept),and a is a constant repre- 
senting the slope of the line (the tangent 
of the angle which the line makes with 
the x-axis). The line y=22+3 is 
shown in Fig. 15. 


7. NON-LINEAR RELATIONSHIPS 


Fic. 16. Graph of : - 
you? —3242 The term “curve” is employed in 


mathematics to designate any line, 
straight or curved, when located with reference to some coor- 
dinate system. It has been noted that equations of the first 
degree in x furnish straight lines when graphed. In case higher 
powers of the argument are present, some other form of curve 


results. One of the sim- 
plest of these is the parab- 
ola the general equation of 
which is y= az?+ be +e, 
where the letters a, b, and 
c again represent con- 
stants which determine 
the particular curve. The 
parabola y=2?2—327+4+2 
is shown in Fig. 16. Here 
positive and negative val- par 
ues of the argument were Fic. 17. Graph of y=e 2_ 
substituted in the equa- 
tion of the parabola to find the corresponding values for y. 
The normal probability curve, which will be used a great deal 


Fee? 


in the subsequent work, may be written in the form y =e 2 


-2.8  -2. -1.2-.8-.4 0 .4 .8 1.21.6 2.0 2.4 2. 
9 gO gh? 8 8 1.21.6 2.0 2.4 2.8 


2 
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where e is the base of the Napierian system of logarithms and 
is equal to 2.71828. The curve in Fig. 17 has been plotted from 
the series of values furnished at the right. These values could 
be calculated directly, but are readily obtained from tables al- 
ready prepared. It is evident that the same positive and nega- 
tive values of the argument give only one value for the function, 
so that the curve is symmetrical about the y-axis. It is also to 
be noted that the vertical scale unit was not taken equal to the 
horizontal one. The choice of scale units will of course in no 
way alter the properties of the curve and is largely a matter of 
taste unless the curve is to be ‘“‘fitted”’ to a series of observa- 
tions. (See Chapter XII, section 5.) 


EXERCISES 


1. Calculate the valuation per inhabitant from the following data, 
computing the per capita valuations to the nearest dollar. Make 
a table ruled up according to the specifications in section 2. The 
columns in the table will be (1) city, (2) population, (3) total valua- 
tion, (4) valuation per inhabitant, (5) rank. 


ESTIMATED VALUATION 
Chae POPULATION IN 1914 or ALL PROPERTY 
(THOUSANDS) ASSESSED 
(THOUSANDS) 

IB alciimOre retest ss. 4s ose. ok. hl eae es 580 $723,800 
BXOSBOR,. & saa ae causes dee Dee ere : 734 1,489,609 
IBYULIRAIIOY 6 8 UG ee ee ae A454 494,200 
Chemalauntln G -aee8 RANG eee ee 639 756,831 
Det Ole sisi 2) Gude Se! ae ee 538 598,634 
MdiamapOlisiy se. so hs sl sts 259 363,414 
JieEeay GHEY J 5 [Bite a ee nee ees 294 257,645 
ISA SashCityan ui GG Gl en eh te us 282 371,191 
MoseAmpelesi mirc ca.) 4 a kdece- Me sre 439 836,604 
IMGLWAUIKCC MT ese scree cso a) fl Soe tan es 417 511,721 
MViitineapOlistetes ea Gt etsy a ae we 343 639,259 
INiewaake v6. cee ea ere nee 389 383,864 
Wier QUE so a 4 G9 Geneon edie 361 314,086 
iGis burs Wier ye lyse. fs Pg ae 565 789,035 
SaAnmerameiscOfrs iy ke Ges 6 ee. vats oa 449 1,247,391 
Sent Cemenrar pecwermere ae a ce 313 473,175 
Ste COUIS TERME eta os cS Sec % 735 1,125,309 


Wid snin ot ONG mrcmmen tm A0ee) sea: ce tel 353 538,390 
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2. Make a column diagram for the following data on centimeter 
or similar graph paper: 

a. Each column represents a grade and is proportional in height 
to the membership of the grade. 

b. Darken the upper part of each column to show the proportion 
of overage children in each grade. 

c. Make the columns one centimeter wide and leave a one-half- 
centimeter space between columns. 

d. Print the total membership over each column or use a scale at 
the left. 

e. Put a suitable title at the bottom of the diagram. 


NUMBER OF NORMAL AND OVERAGE PUPILS IN AN IDEAL SCHOOL IN 
WHICH AN 80 PER CENT PROMOTION RATE IS IN EFFECT 


GRADE TOTAL NORMAL OVERAGE 
TRS Ae aan tsk) te oe Gee ee kum tomas xe 125 120 5 
LS SAC A om ay seers Cane oe ame 125 112 13 
DU eee eer eh eee ee 125 103 22 
EL Vial Pocatirte. WSi-iay can te eimee 1eNetertars Mice 125 92 33 
NV en PO eee gs Ome on ee oe a rns hea 124 82 42 
Wl Cold Mee ere ae emt Rae wee eS 121 73 48 
Vil ath eee Con Spee rye eee ere ee 109 63 46 
VILLE Beers he eee wees Sle Sele 85 55 30 


3. Plot the following pairs of scores for quality and speed_on the 
Ayres Handwriting Scale. 

Q. 42, 31, 65, 59, 38, 62, 35, 47, 5%, 67, 51, 42, 34, 29, 63 

S; .94,.91;,87, 81,-80,.78, 15,74; 75, 70, 10.66, Glaaae tn 


4. Make histograms for the distributions found in Exercise 3 of 
Chapter IT. 


5. Make histograms for the two spelling distributions of Exer- 
cises 6 and 7 of Chapter II. 


6. Construct graphs for the cumulative frequency distributions 
given by Exercises 4 and 8 of Chapter II. 


7. Plot the straight lines, 

(a)y=32—-7, (6) y=22x+6, (c)xa=3y-—4. 

8. Plot the curves, —22 
(a) y=32?+2x—-1, (0) y=42°-—62°+224+8, (c)y=10e 2. 
(Make use of the values given in section 7.) 


CHAPTER IV 
LOGARITHMS 
1. INTRODUCTORY 


For most computations it is best to use a calculating machine, 
but for students such aids are frequently out of the question. 
They must often resort to ordinary arithmetic, slide rules, or 
logarithms in working out statistical problems. In dealing with 
classroom exercises and even extended problems such as those 
arising in connection with a thesis, logarithms will be found to 
be extremely convenient and accurate. The present chapter is 
therefore devoted to a brief account of their nature and use. 

The student who is familiar with logarithms may omit this 
chapter, but it frequently happens that one needs to review this 
subject. The present material may then serve not only as a 
short introduction for the student who knows nothing of loga- 
rithms, but also as a convenient reminder of some of the things 
once known but forgotten. 


2. ARITHMETICAL AND GEOMETRICAL PROGRESSIONS 


An arithmetical progression is a succession of terms such that 
each term differs from that immediately preceding it by a con- 
stant known as the common difference. Show that the following 
are examples of such arithmetical progressions or series : 


ARITHMETICAL PROGRESSION DIFFERENCE, d 
Gh, hy Bas Bie C4. Gate Ope Cn, ee. eet eCuunte ciel imran ROY uc +1 
;. IG, 4), We, UO, eh pee Os oo 6 on bo go oS GG oes —2 
@, Oy ab, UG, Ak AG. 31, 36, - ie Wey ne ee a +5 
d. 23, 33, 5, 64, 74, Sots MUP Vie Lia: Saisie ce pot ess + 14 
e. —5,—8, -—1,4+1, Eine apie OS a Bs to, err +2 
io Gy Oo Gh GILG GaP SCR II 4 ne oe eo oe oO +d 
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A geometrical progression is a series of terms such that each 
term is the product of the preceding term by a constant known 
as the ratio. Examples of such progressions are as follows: 


GEOMETRICAL PROGRESSION : RATIO, r 
Cs ead ety’ eee Das Pom A RE PAC OOn eA te ks ty Py be Ka een GS es 
Oy 71005505255 "1225 v6.20; och ic tome tome nee —5 
ee Par ee are POCO xe aie Opie ou oh adr Oh th Bh ge = 3 
Gh Gy Chey GH, Gi Wis, O25 5 6 8 5 4 a os pick ee ake r 


The abbreviation for an arithmetical progression is A.P. 
and for a geometrical progression is G. P. 

The arithmetical mean of a series is obtained by dividing the 
total of the numbers by the number of items in the series. Thus 
in the series 1, 2, 3, 4, 5, 6, 7, the mean is 28/7 = 4. A general 
procedure for finding the mean of any arithmetical series may 
be shown as follows: Let the first term and the difference be 
any algebraic numbers denoted by a and d and let n be a posi- 
tive integer representing the number of terms. We may then 
write 

Numberofterm:1 2 3 ~*- n 
Progression : aatda+2d---a+(n—l1)d 


The last term, or J, is clearly given by the formula 
l=a+(n—-1)d. (1) 


If s denotes the sum of the ” terms in such a progression, this 
sum written in natural and in reverse order will give 


s=a+([a+d]+la+2d]+---[a+(n—1)d] 
and s=/4+([l—d]+[l—2d]+ ---{l—(m—1)d]. 


Adding these two equations, member by member, we find that 


oo nla): (2) 
2 
The arithmetical mean, A. M., is therefore given by 


eS haere 
A.M. = imiers (3) 
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Applying this formula to the A.P. 6, 11, 16, 21, 26, 31, we 
obtain A.M. = ee = 18.5. 


If three numbers form a G.P. the middle number is called 
the geometrical mean of the other two and is obtained by extract- 
ing the square root of their product. This follows at once from 
the general form of a G.P.: a, ar, ar?---ar*-1=1. For any 
two numbers a and }, therefore, the geometrical mean is given 


by the formul 
ae G.M. = Vab. (4) 


EXAMPLE. The G.M. of 1 and 9 is V1xX9=8, that is, 
1, 3, 9 are in a G. P. with ratio 3. . 

Insert four geometric terms between 18 and 3%. The first 
term a= 18, l= 3%, and n=4-+2=6. Since 1=ar"-! we 


l 1 ‘ 
have nics Ae 35 whence r=%. The required terms are 


therefore 6, 2, #, and ¢. 


3. THE INVENTION OF LOGARITHMS 


The most important discovery in the development of mathe- 
matical computation was the invention of logarithms by John 
Napier, Baron of Merchiston of Scotland (1550-1617). The 
principle underlying his invention may be explained in terms 
of arithmetical and geometrical progressions. 

Let such a pair of associated series be given as follows: 


epeO lesen arb 6 ie 28> oo 2 F710 
G.P. 1 2 4 8 16 32 64 128 256 512 1024 


The product of any two numbers in the second line of numbers 
(G.P.) may be found by adding the corresponding numbers in 
the A.P., finding this sum in the A.P., and finally taking the 
corresponding number in the G. P. line as the required answer. 
Thus the product 4 x 128 may be found by adding 2 and 7 (the 
numbers in the A.P. corresponding), finding their sum (9) in 
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the A. P., and then the corresponding number (512) in the G. P., 
this being the required product. The time-saving principle 
illustrated by this method is that the process of multiplication 
is replaced by that of addition. 

It is apparent that series such as the above furnish only a few 
of the possible products which might be required. The system 
needs, therefore, to be extended. In addition to continuing the 
progressions at either end, Napier inserted terms as illustrated 
by the following series : 


ATP es 0 3) 1 1.5 2 2.0. 3 3.9 


p.{t WE V8 A N\/32> = 8 ey 198 
CARO swab ee Ser a 5.66 § 1131 


This amounts to inserting arithmetical and geometrical means 
between the original terms. 

The above series are tabular representations of the function 
y = 27, where x denotes the numbers in the A.P., and y the 
numbers in the G. P.. The number 2 is called the base and 2 is 
said to be the logarithm of y to the base 2. The logarithm of a 
number is thus the exponent to which a fixed number, called the 
base, must be rarsed to equal the given number, or, if y = b7, then 
x is the logarithm of y to the base b, or x = log, y. , 

If 2 is the base, log.64 = 6, because 2° = 64; if 8 is the base, 
logs64 = 2, because 8? = 64. The number of possible bases is 
clearly infinitely large. 

The invention of logarithms by Napier stimulated an Eng- 
lishman by the name of Henry Briggs to work out a system of 
logarithms to the base 10. Between the years 1617 and 1628, 
Briggs and others completed tables of logarithms up to 100,000 
carried out to fourteen decimal places. Many other tables have 
since been computed, one of the most complete being a 20-place 
table carried out by Mr. A. J. Thompson* under the direction 
of Professor Pearson. 


* A. J. Thompson, Logarithmetica Britannica, being a Standard Table of Loga- 
rithms to Twenty Decimal Places. Cambridge University Press, London, 1924. 
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The present chapter wil) deal entirely with the Briggs loga- 
rithms with a base of 10. Before considering their use, how- 
ever, a brief review of the laws of exponents will be given. 


4, LAWS OF EXPONENTS 
The symbol 0” is used to represent the product of a to n equal 
factors, or 0” =a-a-0-a---to n factors where n is a positive 
integral (whole) exponent. Certain fundamental laws for such 
exponents may now be given as follows: 
lL. (&”)(o") = 0” *"; for example, (107)(107) = 10?+4 = 10%. 
This follows at once from the fact that 
(0) =(a-a-0---to m factors)(a-a-a---to n factors) 
=6-b-b-GD---to (m+n) factors. 
The omy laws are proved in a similar way. 
IL —=0"; for example, z= 6-4 — 82, 
IIL one =o: for example, (207)? = 20°. 
IV. (b)* =o’; for example, (3 x 4)? = 3? x 4. 
© a 2 
3) = BF 
The above laws also hold when the exponents are any positive or 


negative integral or fractional numbers. Fractional and nega- 
tive exponents are defined as follows: 


=V a”; for example, 37 = v 8? = 


AZ 1 
7 ae (ie aes 
a for example, 16— 162 = 356 


th ; 
~@ 


If a-* = a it follows that o°—* = a = 1. 


Thus any number to the zero power is equal to one. This is an 
important law and should be remembered. 
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Some further illustrations of the above laws are as follows: 
1Gi ae Det 1 alto 
16? VIE 2 
(272)8 = V7)? 
(4 x 9)-? = 36— 


4 


L? 
2 


5. LAWS OF LOGARITHMS 


From the definition of a logarithm and the laws of exponents, 
the basic principles for logarithmic computation may be ex- 
pressed as follows: 

I. The logarithm of a product is equal to the sum of the loga- 
rithms of the factors, or 

log, MN = log, M + log, N. 

Proor. Let «= log, M and y=log,N. Then b*?=M and 
bY = N (from the definition in section 3) and MN = b*, or 
log, MN = x+y = log, M + log, N. The proofs for the re- 
maining laws are similar. 

II. The logarithm of a quotient is equal to the logarithm of the 
dividend minus the logarithm of the divisor, or 


log, 7 = log, M — log, N. 


III. The logarithm of the nth power of a number is n times the 
logarithm of the number, or 


log, M" = n log, M. 
IV. The logarithm of the nth root of a number is one-nth of the 
logarithm of the number, or 


log, VM = t log, M. 


LOGARITHMS 53 


6. THE BRIGGS SYSTEM OF LOGARITHMS 


Returning to the Briggs system of logarithms we may note 
that the logarithm of any number N to the base 10 is the ex- 
ponent x to which 10 must be raised to produce the number N. 
Thus, if 2 = logio N, 
then LOZ==EN: 

Inasmuch as 10 is always the base here considered we may here- 
after write more briefly, 
x = log N. 

From the above definition we may write down the logarithms 

of certain numbers at once as shown in Table 9. 


TABLE 9. SHOWING THE LOGARITHMS OF NUMBERS WHICH ARE 
MULTIPLES OF 10 


NUMBER LOGARITHM AUTHORITY 
100,000. 5 105 = 100,000. 
10,000. 4 104 = 10,000. 
1,000. 3 102s O00: 
100. | 2 AQ 100. 
10. 1 10! = 10. 
1. 0 10°= ib, 
Ai! -1 LOS = al 
01 —2 1052°= 01 
001 -—3 NO mene .001 
.0001 —4 Oe -0001 
.00001 —5 WS -00001 


The logarithm of a number between 100 and 1000 will evi- 
dently be somewhere between 2 and 3, that is, some fractional 
exponent. The number having the logarithm 2.5, for example, 
may be found by taking the geometric mean of 100 and 1000, or 
V 100,000 = 316.2. We may then write log 316.2 = 2.5. 

It is evident that logarithms consist of an integral and a 
decimal part, the former being called the characteristic and the 
latter the mantissa of the logarithm. Thus for the logarithm of 
316.2 the characteristic is 2 and the mantissa is .5. 
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Very complete tables of mantissas have been computed as 
described above and conveniently tabled for use. The charac- 
teristic, it will be noted, may always be obtained by inspection. 

In order to illustrate the procedure in finding the complete 
logarithm a four-place table of mantissas is given on pages 60 
and 61. Let the logarithm of 48.2 be required. Since this num- 
ber lies between 10 and 100 its logarithm will be between 1 and 
2 and hence the characteristic is 1. ’ 

The mantissa, or decimal part, is found by looking down the 
column under N for the figures 43 and then proceeding to the 
right until the column headed 2 is reached. The number found 
is 6355. The decimal points have been omitted in the table, so 
that the complete logarithm is 1 + .6355, or log 48.2 = 1.6355. 

If the number had been 4.32, the characteristic would have 
been zero and the mantissa the same as before. Therefore, 
log 4.82 = 0.6855. This result is evident from the laws of ex- 
ponents, for if 101-6355 — 43,2, 


then LOt6s55 = 10 = 4/32, 
or 1000802 ==. 4 BZ. 


For the logarithm of .432, the characteristic will be — 1 and 
the mantissa will again be equal to + .6355. Instead of adding 
these two values directly, however, it is found more convenient 
to keep the mantissa positive and write 


log .482 = 9.6355 — 10, 


by adding and subtracting 10 from the characteristic. 

The general rule for determining the characteristic of a loga- 
rithm may now be stated as follows: The characteristic of a 
number greater than 1 2s one less than the number of digits to the 
left of the decimal point; while the characteristic for a number less 
than 1 is negative and one greater (numerically) than the number 
of zeros between the decimal point and the first significant figure. 

In looking up the mantissa of a number the rule is to neglect 
the decimal point and find the nearest mantissa for the given 
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sequence of digits. A more accurate method will be shown in 
section 7, where linear interpolation is presented. 
The logarithms of the following numbers should now be veri- 

fied by these rules and Table 10. 

log 6.37 = 0.8041 log .00004 = 5.6021 — 10 

log .0637 = 8.8041 — 10 log 1910 = 3.2810 

log .00637 = 7.8041 — 10 log 20000 = 4.3010 

log 1.01 = 0.0043 log 2 = 0.3010 

log .001 = 7.0000 — 10 log .999 = 9.9996 — 10 


A few short calculations may now be illustrated by the use of 
logarithms. Let the product 6.37 x 1910 be required. By the 
first law of the preceding section, 


log (6.37 X 1910) = log 6.387 + log 1910 
= (2.8041 + 3.2810 = 4.0851. 


The number corresponding to the logarithm 4.0851 is clearly 
between 10,000 and 100,000, and the sequence of the digits is 
determined by the mantissa .0851. The nearest mantissa in 
Table 10 is .0864, corresponding to the number 122, so that 
the required product to three figures is 12,200. By direct mul- 
tiplication the product is 12,166.70. 

The steps in the above calculation were as follows: 

1. Finding the logarithms of the factors (.8041 and 3.2810), 

2. Adding these logarithms (4.0851), 

3. Looking for the number (N) corresponding to the mantissa 
of the sum of the logarithms (122 corresponds to .0864), and 

4, Determining the number of places in the result by noting 
the characteristic of the sum of the logarithms (characteristic 
4 gives five digits before decimal point), and supplying zeros for 
the missing digits. (Answer is 12,200.) 

Next let the quotient an be required. By the second law of 
logarithms, log quotient = log .0437 — log 6920, or (8.6405 — 10) 
— 3.8401 = 4.8004 —10. The reason for adding and subtracting 
10 for negative characteristics now becomes apparent, for the 
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subtraction may be made continuous; that is, on reaching the 
decimal point, 1 may be borrowed from the 8, which is positive. 
The difference is therefore 4.8004 — 10. Looking in the table 
for the mantissa nearest .8004 we find .8007, which corresponds 
to the sequence of digits 632. The characteristic 4—10, or — 6, 
shows that five zeros must follow between the decimal point and 
the first significant figure in the number. The required quotient 
is therefore .00000632. By arithmetical calculation we obtain 
.000006315 +. 

The great convenience of logarithms is shown especially in 
raising a number to a given power. If (.642)® be required, the 
third law of logarithms may be applied, and we find that 


log (.642)§ = “At log .642 = 6(9.8075 — 10) 
= 58.8450 — 60 
= 8.8450 — 10. 


The nearest mantissa is .8451 for N = 700, and the characteris- 
tic is —2. The answer is therefore .0700. By multiplying out 
(.642) (.642) - - - to six factors we obtain .07002. 


By applying the fourth law, V .777 may be found as follows: 


log V.777 = 715 log .777 = +45 (9.8904 — 10) 
| = +15(99.8904 — 100) 
— 9.98904 — 10. 


The required root is therefore .975. 


7. INTERPOLATION 


A graph of the logarithm function y = logio N may be made 
by plotting a few of the values from Table 10. (See Fig. 18.) 

The logarithm of 7 is given by the ordinate .8451, while the 
logarithm of 8 is represented by y = .9031. If the logarithms 
between 7 and 8 were unknown, an approximation to the loga- 
rithm of 7.5 could be obtained by assuming that the function is 
a straight line over this interval and taking the ordinate at 7.5 
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as the required logarithm. Graphically, this amounts to measur- 
ing the ordinate PQ shown in Fig. 18. Arithmetically, the pro- 
cedure is to take half the sum of the logarithms of 7 and 8, or 
8741. Reference to the table, however, gives log 7.5 = .8751, 
so that there is an error of .001 in this case. 

The above method is known as linear interpolation and is 
extremely useful in case the interval over which the interpo- 
lation is carried is small. In such cases the function will be 
so nearly linear that only 
a slight error will result. 
Values of the function be- 
tween those given in the 
table may be found, and 
hence a greater degree of 
accuracy may be obtained 
than in the tabled entries. 

Thus, if the logarithm of 
7.637 be required, the log- 
arithms of 7.63 and 7.64 
may be found in Table 10 
and the extra amount for °% 
007 found byinterpolation. 71018. Graph of 4g stating 
An enlarged portion of the 
graph is shown in Fig. 19. The difference between the loga- 
rithm for 7.64 and 7.63 is .0006 and is known as the tabular 
difference. From similar triangles it is now apparent that 
<a = ~~ or ¢ = .7(.0006) = .0004, where c is the correction 
to be added to the lower tabular value. The logarithm of 7.637 
is therefore .8825 + .0004 = .8829. (From a seven-place table 
the logarithm is .8829228.) 

~The labor of computing each correction is saved by using a 
table of proportional parts shown at the right of the main figures 
in Table 10. In finding the logarithm of 7.637, for example, it 
is only necessary to look up the logarithm of 7.63, move out along 


= 


wf &@ Rm & 4 w& © So iY 
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this line to the value 7 at the top or bottom of the proportional 
parts, and read off the entry 4. This last result is to be added 
to the fourth place of .8825, giving .8829 as before. In a simi- 
lar way, the logarithms of 6.349 and .04233 are 0.8027 and 
8.6266 — 10, respectively. 

The table of proportional parts is also useful in looking up 
the number corresponding to a given logarithm. This may 
be illustrated by the following problem: » 

Find the product of .7437 
Tabular mw, and 8.242. 


log .7487 = 9.8714 — 10 
log 3.242 = 0.5108 
log prod. = 10.3822 — 10 


The nearest -mantissa smaller 

than .38822 is .8820, which corre- 

Fic. 19. Showing linear inter- sponds to the number 241. The 

polation for log 7.637 difference .0002 is now found in 

the proportional parts on the 

same line and by moving up to the top is found to correspond 

to 1. This last result should be adjoined to the three figures 

already found, giving as the required number 2.411. The whole 

procedure will become clearer if the logarithm of 2.411 is now 
worked out as shown in the paragraph above. 

The method of linear interpolation will be sufficiently accu- 
rate for small differences in logarithms and similar functions 
where one (and possibly two) places beyond those given in the 
table are required. Thus with Table 10 linear interpolation is 
adequate for the logarithms of four-place numbers, and with 
a five-place table such as Taylor’s* similar interpolation gives 
logarithms of five-place numbers. 

More exact methods of interpolation are often required in 
advanced statistical work, but the formulas become quite com- 


* Taylor, Five-Place Logarithmic and Trigonometric Tables. Ginn and Com- 
pany. This table is especially recommended on account of its excellent physical 
make-up and the thumb index with which it is provided. 


—" 
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plicated and are used so seldom in elementary work that they 
are omitted here. For a clear account the student should con- 
sult Forsyth,* and for more advanced treatment an excellent 
work by Whittaker and Robinson.+ 


8. SOME ADDITIONAL PROBLEMS 


It should be noted that the operations of addition and sub- 
traction of numbers cannot be carried out by logarithms. Thus, 
if the problem to be worked out is | 

(6.748) (89.24) — (36.5) : 
475 
this must be broken up into two parts which are worked sep- 
arately by logarithms and combined only when the final answers 
are obtained. The work will then be as follows: 


log 6.748 = 0.8289 log 36.5 = 11.5623 — 10 
log 89.24 = 1.9506 log 475 = 2.6767 
log prod. = 2.7795 log quot. = 8.8856 — 10 
log 475 = 2.6767 =. quot, =) ».07634 
log quot. = 0.1028 
Baguot.— 1.267 


The required answer is therefore 1.267 — .077 = 1.190. As we 
shall see in the next chapter, such a result should not be carried 
beyond four figures. 

In subtracting the logarithm of 475 from log 36.5 it will be 
noted that 10 has been added to and subtracted from the char- 
acteristic of the latter in order to facilitate the final subtraction 
of the logarithms. 

A typical problem that occurs in statistical calculation is of 
the form 


S 3483 / 387\? 
SeDs= Re —_ o° Ih for example, | "794. (aul |» 


* Forsyth, Mathematical Analysis of Statistics, chap. iii.. Wiley, 1924. 
+ Whittaker and Robinson, The Calculus of Observations, D. Van Nostrand 
Company, 1924. 
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TABLE 10. FOUR-PLACE LOGARITHMS OF NUMBERS 


455 +6 ie Sa0 


i! 


2 | DYN NN NDNONNDY NNNND NNONNNDY WHWWWHW WHoWWW PRARR RoMAn An AWM! wo 
ro) 


Nilson le Coe oon 4 Oe Om mee 9 


10 | 0000 0048 0086 0128 0170 0212 0253 0294 0334 0374 12 17 21 25 29 33 37 
11 | 0414 0453 0492 0531 0569 0607 0645 0682 0719 0755 11 15 19 23 26 30 34 
12 | 0792 0828 0864 0899 0934 0969 1004 1038 1072 1106 10 14 17 21 24 28 31 
13 | 1139 1173 1206 1239 1271 1303 1335 1367 1399 1430 10 13 16 19 23 26 29 
14 | 1461 1492 1523 1553 1584 1614 1644 1673 1703 1732 12 15 18 21 24 27 
15 | 1761 1790 1818 1847 1875 1903 1931 1959 1987 2014 11 7 20 22 25 
16 | 2041 2068 2095 2122 2148 2175 2201 2227 2253 2279 11 6 18 21 24 
17 | 2304 2330 23552380 2405 2430 2455 2480 2504 2529 |} 10 17 20 22 


18 | 2553 2577 2601 2625 2648 2672 2695 2718 2742 2765 
19 | 2788 2810 2833 2856 2878 2900 2923 2945 2967 2989 


20 | 3010 3032 3054 3075 3096 3118 3139 3160 3181 3201 


Be Re eR RR 


ao} PP PRR PROIOTION CLOTONOIO AAARA ARANNA AWWWOH DDOCOH HPNNMWR 
HH RRR ER 


GB | CIMINO AMNIIAD AWRMMAMD ANNAN WWHNHWH OWOWCSO HRHNNW WRN 


21 | 3222 3243 3263 3284 3304 3324 3345 3365 3385 3404 14 16 18 
22 | 3424 3444 3464 3483 3502 3522 3541 3560 3579 3598 141517 
23 | 3617 3636 3655 3674 3692 3711 3729 3747 3766 3784 131517 
24 | 3802 3820 3838 3856 3874 3892 3909 3927 3945 3962 12 14 16 
25 | 3979 3997 4014 4031 4048 4065 4082 4099 4116 4133 121415 
26 | 4150 4166 4183 4200 4216 4232 4249 4265 4281 4298 111315 
27 | 4314 4330 4346 4362 4378 4393 4409 4425 4440 4456 111314 
28 | 4472 4487 4502 4518 4533 4548 4564 4579 4594 4609 111214 
29 | 4624 4639 4654 4669 4683 4698 4713 4728 4742 4757 10 12 13 
30 | 4771 4786 4800 4814 4829 4843 4857 4871 4886 4900 10 1113 
31 | 4914 4928 4942 4955 4969 4983 4997 5011 5024 5038 101112 
32 | 5051 5065 5079 5092 5105 5119 5132 5145 5159 5172 11 12 
33 | 5185 5198 5211 5224 5237 5250 5263 5276 5289 5302 10 12 
34 | 53815 5328 5340 5353 5366 53878 5391 5403 5416 5428 10 11 
35 | 5441 5453 5465 5478 5490 5502 5514 5527 5539 5551 10 11 
36 | 5563 5575 5587 5599 5611 5623 5635 5647 5658 5670 10 11 


37 | 5682 5694 5705 5717 5729 5740 5752 5763 5775 5786 
38 | 5798 5809 5821 5832 5843 5855 5866 5877 5888 5899 
39 | 5911 5922 5933 5944 5955 5966 5977 5988 5999 6010 


40 | 6021 6031 6042 6053 6064 6075 6085 6096 6107 6117 
41 | 6128 6138 6149 6160 6170 6180 6191 6201 6212 6222 
42 | 6232 6243 6253 6263 6274 6284 6294 6304 6314 6325 
43 | 6335 6345 6355 6365 6375 6385 6395 6405 6415 6425 
44 | 6435 6444 6454 6464 6474 6484 6498 6508 6518 6522 


45 | 6532 6542 6551 6561 6571 6580 6590 6599 6609 6618 
46 | 6628 6637 6646 6656 6665 6675 6684 6693 6702 6712 
47 | 6721 6730 6739 6749 6758 6767 6776 6785 6794 6803 
48 | 6812 6821 6830 6839 6848 6857 6866 6875 6884 6893 
49 | 6902 6911 6920 6928 6937 6946 6955 6964 6972 6981 


50 | 6990 6998 7007 7016 7024 7033 7042 7050 7059 7067 
51 | 7076 7084 7093 7101 7110 7118 7126 7135 7148 7152 
52 | 7160 7168 7177 7185 7193 7202 7210 7218 7226 7235 
53 | 7243 7251 7259 7267 7275 7284 7292 7300 7308 7316 
54 | 7324 7332 7340 7348 7356 7364 7372 7380 7388 7396 


rh eed Roe eee one om a 


-_ Fat ep RR RRP Rt DOIN NNN ND NNNwWwW wWwwrhe 
Ol NONNMWW WWWWWHW WHWWWH WWW PRPPPR ROO MRMARRDM ANQNAWO OOOHND 


Pm) CWwWwWW FREE PPR RR RPOLOTOVOT CLOOIAD AAADWA ANWOMOHO WHO 


oO] ANNWOO WDNDmDDmo DODDS COO 


The proportional parts are stated in full for every tenth at the right-hand 
side. The logarithm of any number of four significant figures can be read 
directly by adding the proportional part corresponding -to the fourth Boe 
to the tabular number corresponding to the first three figures. 
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TABLE 10. FOUR-PLACE LOGARITHMS OF NUMBERS (CONTINUED) 


61 


- Mh 3 4 5 66 7 8 9 


i 


for) 


7404 7412 7419 7427 7435 7448 7451 7459 7466 7474 
7482 7490 7497 7505 7513 7520 7528 7536 7543 7551 
7559 7566 7574 7582 7589 7597 7604 7612 7619 7627 
7634 7642 7649 7657 7664 7672 7679 7686 7694 7701 
7709 7716 7723 7731 7738 7745 7752 7760 7767 7774 


7782 7789 7796 7803 7810 7818 7825 7832 7839 7846 
7853 7860 7868 7875 7882 7889 7896 7903 7910 7917 
7924 7931 7938 7945 7952 7959 7966 7973 7980 7987 
7993 8000 8007 8014 8021 8028 8035 8041 8048.8055 
8062 8069 8075 8082 8089 8096 8102 8109 8116 8122 


8129 8136 8142 8149 8156 8162 8169 8176 8182 8189 
8195 8202 8209 8215 8222 8228 8235 8241 8248 8254 
8261 8267 8274 8280 8287 8293 8299 8306 8312 8319 
8325 8331 8338 8344 8351 8357 8363 8370 8376 8382 
8388 8395 8401 8407 8414 8420 8426 8482 8439 8445 


8451. 8457 8463 8470 8476 8482 8488 8494 8500 8506 
8513 8519 8525 8531 8537 8543 8549 8555 8561 8567 
8573 8579 8585 8591 8597 8603 8609 8615 8621 8627 
8633 8639 8645 8651 8657 8663 8669 8675 8681 8686 
8692 8698 8704 8710 8716 8722 8727 8733 8739 8745 


8751 8756 8762 8768 8774 8779 8785 8791 8797 8802 
8808 8814 8820 8825 8831 8837 8842 8848 8854 8859 
8865 8871 8876 8882 8887 8893 8899 8904 8910 8915 
8921 8927 8932 8938 8943 8949 8954 8960 8965 8971 
8976 8982 8987 8993 8998 9004 9009 9015 9020 9025 


9031 9036 9042 9047 9053 9058 9063 9069 9074 9079 
9085 9090 9096 9101 9106 9112 9117 9122 9128 9133 
9138 9143 9149 9154 9159 9165 9170 9175 9180 9186 
9191 9196 9201 9206 9212 9217 9222 9227 9232 9238 
9243 9248 9258 9258 9263 9269 9274 9279 9284 9289 


9294 9299 9304 9309 9315 9320 9325 9330 9335 9340 
9345 9350 9855 9360 9365 9370 9375 9380 9385 9390 
9395 9400 9405 9410 9415 9420 9425 9430 9435 9440 
9445 9450 9455 9460 9465 9469 9474 9479 9484 9489 
9494 9499 9504 9509 9513 9518 9523 9528 9533 9538 


9542 9547 9552 9557 9562 9566 9571 9576 9581 9586 
9590 9595 9600 9605 9609 9614 9619 9624 9628 9633 
9638 9643 9647 9652 9657 9661 9666 9671 9675 9680 
9685 9689 9694 9699 9703 9708 9713 9717 9722 9727 
9731 9736 9741 9745 9750 9754 9759 9763 9768 9773 


9777 9782 9786 9791 9795 9800 9805 9809 9814 9818 
9823 9827 9832 9836 9841 9845 9850 9854 9859 9863 
9868 9872 9877 9881 9886 9890 9894 9899 9903 9908 
9912 9917 9921 9926 9930 9934 9939 9943 9948 9952 
9956 9961 9965 9969 9974 9978 9983 9987 9991 9996 
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With a machine or even by long-hand arithmetic, this result 
may be worked out quite rapidly if a good table of squares is 
used. The logarithmic calculation is a little awkward on ac- 
count of the subtraction under the radical, but inasmuch as 
some students find it convenient the method may be illustrated 
as follows: 


log 3483 = 3.5420 log 37 = 11.5682 — 10 
log 794 = 2.8998 log 794 = 2.8998 
S log C = 8.6684 — 10 
log = = 0.6422 
C8 N log C2 = 17.3368 — 20 
Aa kes “. C2 = .002 
oS = 4.387 
S 
Rhea (eT 


N 
log (S = c?) = 0.6420 


[S_ oi 
log N C7= 03210 


logh = .6990 
log S.D.= 1.0200 a be), = 10AT 
Another calculation may be illustrated by the formula 


ess é, 743.2 ; 
No.0, 682(2.673) (2.794) 


r - For example, r 


This is readily adapted to logarithmic work: 


log 682 = 2.8338 log 743.2 = 12.8711 — 10 
log 2.673 = 0.4270 log prod. = 3.7070 
log 2.794 = 0.4462 log r = 9.1641 — 10 
log prod. = 3.7070 Se pes LESS 


A final problem may be worked out in the case of the geo- 
metric mean of several quantities where 


GM. =\/ Xi Xe ee 
for example, G.M. = V27.4 x 29.5 X 28.3 X 29.2% 29.9 
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We therefore have: 

log 27.4 = 1.4878 
log 29.5 = 1.4698 
log 28.3 = 1.4518 
log 29.2 = 1.4654 
log 29.9 = 1.4757 
log prod. = 7.3005 
z log prod. = 1.4601 

Gein 25.6 


EXERCISES 


1. Find the logarithms of the following numbers by a four-place 
table. Check your results by referring to a five-place table: 634.2, 
59.61, 1.722, .004359, .1166, .00004795, 5566., 6234000. 


2. Work out the following operations by logarithms: 


(1) 6432 x 08475 (9) S[487.1 (gy ___1_, 
6.742 : 3622.” V472 x 347 


(4) a (5) VeT x 68 x 69 x 70" 


Ans. (1) .008815, (2) .7080, (3) .00247, (4) .0001066, (5) { 68-49. . 


8. Calculate the standard deviations with the following data, 


using the formula S.D. = F [= = 0 | h. 


Ss N Cc h S.D. (Ans.) 
(1) 4782 462 0123 5 16.0 
(2) 1692 192 1.1340 3 8.23 
(8) 1578 641 843 0.25 0.330 


4. Compute the correlations for the data below, using the formula 


Vibe a b c r (Ans.) 
(iQ) As 235 182 851 
(2) 2384 234 259 ODL: 
(3) 1938 291 279 677 
(4) — 64.2 173.3 1892 — .112 


(5) 831 831 831 1.000 


64 STATISTICAL METHODS IN EDUCATION 


5. Compute the geometric means for the following series : 
a. 169, 171, 165, 168, 178, 175, 170. (70.1. Ans.) 
b. 33.1, 84.2, 33.4, 34.5, 33.6, 34.7, 34.8, 33.9. (84.0. Ans.) 


6. Calculate ric.3 by the following formula: 


UE 2, UES IEE 


3 
oN at tag ara 
Use Holzinger’s* Table VII for log V1 — 72. 


12 713 «¥23~—S—«12,3 (Ans.) T12 T13 r23-Ss«12.3 (Ans.) 
CUES Zeer Gel (aes Le (4) 481 .827 .214 .391 
@) DY alle aly duos GA Ace A LIS 
(8) .80 .80 .80 .444 (6) 9382532214934 


7. Using the formula 
1 — Riess = (1 — riz) (1 — ri3.2)(1 — riszs), 
work out the values of R1(234) with the aid of Holzinger’s Table VI. 


riz 113.2 114.23 Ri(234) (Ans.) 
(GD) oral -620 AT4 .906 
(2) .883 .695 347 .928 
(S)EoS .062 — .007 -756 
(4) .815 -742 .676 .958 


* Karl J. Holzinger, Statistical Tables for Students in Education and ie 
ogy. The University of Chicago Press, 1925. 


GHAPTEHRAV. 
ERRORS IN CALCULATION AND MEASUREMENT 
1. ACCURACY IN STATISTICAL METHOD 


In dealing with statistical material it is desirable to recognize 
very early the importance of accuracy not only in the calculations 
which need to be performed but in the data themselves. The 
student should train himself to be accurate in his computations 
and to employ adequate checks wherever possible. He should 
also be cautious as to accuracy of the data which he is using, 
in order to safeguard against making unwarranted conclusions 
from the results obtained. 

Actual blunders in calculation can best be obviated by ex- 
treme care and adequate methods of checking all of the com- 
putations. Even with such mistakes eliminated, however, it is 
necessary to be cautious regarding the number of places to use 
in order to obtain a result to a given degree of accuracy. The 
distinction between different types of error is also important. 
For these reasons the present chapter will be devoted to some 
of the simplest principles involved in errors of calculation and 
measurement. 


2. ABSOLUTE AND RELATIVE ERRORS 


An error may be defined as the discrepancy between the ob- 
tained and the true values from a numerical process or meas- 
urement. If X, be an obtained value and X the true value, 
the difference, FE; = X;— X, is known as the absolute error. The 
ratio of the absolute error to the true value, or E;/X, is called 
the relative error. For example, suppose the true value of X is 
67.5 inches, and measurements X; = 66.9 inches and X_ = 69.7 


inches have been made. The two absolute errors will be 
65 
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E,=— 0.6 and E,=+ 2.2, while the corresponding relative 
errors will be — .01 and + .03, or —1 per cent and +3 per cent. 

Whenever values are obtained from the measurements of some 
continuous variables such as height, they can never be exact nor 
can their true value ever be determined. All such values, includ- 
ing the errors themselves, must be approximations. The best 
that can be done is to measure to a certain degree of accuracy, 
take the average of a number of observations as an approxima- 
tion to the true value, and consider the variations from this 
result as errors. Thus suppose a stick is measured ten times to 
the nearest millimeter and the following observations are re- 
corded: 57, 58, 58, 56, 57, 60, 57, 55, 56, 56. Their average, or 
57, might be taken as the true or most typical value, and the 
variations 0, +1, +1, —1, 0, +8, 0, —2, —1, —1 would be 
considered as absolute errors although they are themselves only 
approximations to the true errors. 

In case we are dealing with a discrete series such as the num- 
bers of pupils in various school grades the resulting observations 
of grade size may be considered as exact. It should be noted, 
however, that the unit of tabulation in such a series is the pupil, 
and that these units are equal to one another only in a ee 
limited sense, that is, as human entities. 


3. BIASED AND UNBIASED ERRORS 


Errors which tend to compensate or offset one another in the 
long run are known as unbiased or compensating errors. A good 
example is furnished by the rownding off of numbers to a smaller 
number of places as in Table 11. 

In rounding off the numbers to the nearest thousand, figures 
less than 500 are discarded and those greater than 500 are con- 
sidered as 1000. If figures had occurred at exactly 500, they 
would have been equally divided above and below, or in ease of a 
single such number, 1000 would have been added. In the table on 
page 67 the “errors” in rounding were — 347, — 143, +365, + 228, 
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TABLE 11. AVERAGE DAILY SCHOOL ATTENDANCE IN THE 
UNITED STATES, 1870-1910 


THOUSANDS OF 

YEAR ene me CHILDREN IN 
an A.D.A. 
SHED! 3. eG = om ope eS Ree ede a nee ee 4,077,347 4,077 
IBSRIO! i a” Gps ea Soe ety nen cen Can eas en 6,144,143 6,144 
SOO Rear eee eRe ae ee ee cL 8,153,635 8,154 
ES OO Pears oN het ved des 10,632,772 10,633 
OL OM ges ewe sy hk I eee te oe 12,827,307 12,827 
EL Oa ria at es WP yh: Pa a ee a 41,835,204 41,835 
AN Crag CEM MES fae Metak ap floor! cA nee ot 3 «, 8,367,041 8,367 


and — 307, with an algebraic total of — 204. Even with so short 
a series the total error was relatively small, and for a longer list 
of numbers it would tend to become less because of the random 
distribution of digits greater and less than five. 

Unbiased errors are very important in the theory of averages 
because their balancing effect will tend to make the average 
more accurate than the original numbers. Thus, in Table 11, the 
absolute error in the rounded average is only 41, which is much 
less than that of any individual item. As a caution it should 
be added that the above short series happened to illustrate 
the above principles 
and was therefore 
chosen. In general a 
longer series would 
be required to secure 
any considerable bal- 
ancing of errors. Fic. 20. Distribution of grades of a class of 

Biased errors are pupils by a regular teacher and by a substitute 
those which do not Based on one month’s class work and five tests 
tend to neutralize 
one another but accumulate in such a way as to produce a 
relatively large error in the total or average. Such errors are 
illustrated in Fig. 20 by the marks of two teachers. 


ES DEC y BWA 


By regular teacher By substitute teacher 
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If biased errors are present in a series of observations, the 
average will tend to be as inaccurate as the individual measure- 
ments upon which it is based. Suppose that a meter stick is one 
centimeter shorter than the standard. All measurements with 
it will have a relative error of 1 per cent in the same direction 
and the average will be likewise affected as illustrated in the 
following table: 


TABLE 12. HYPOTHETICAL MEASUREMENTS WITH CONSTANT ERROR 
OF 1 PER CENT 


OBSERVED MEASUREMENT | CONSTANT ERROR 
69 + .69 
f(a td 
69 + .69 
68 + .68 
73 + .73 
Ota lene OU) + 3.50 
Average... 70 + .70 


4. SIGNIFICANT FIGURES 


The digits in a numerical result which are known to be correct 
are called significant figures. Thus, if a measurement such as 
39.6 mm. be made, it is assumed to be correct. to the nearest 
tenth of a millimeter and is said to have three significant figures, 
the true value lying anywhere between 39.55 mm. and 39.65 mm. 
If the same result is expressed as .0396 meter, it is still to be con- 
sidered as correct to three figures, the zero after the decimal 
point merely serving to fill a space. When zeros occur on the 
right of a series of digits the significant figures may be shown by 
the use of a decimal point. For example, a measurement such 
as 2600. is correct to four figures or is between 2599.5 and 2600.5, 
while 2600 is correct to only two figures and lies between 2550 
and 2650. By way of further illustration, the following num- 
bers would all be considered as correct to five significant figures : 
47.234, .00036924, .0042000, 4349.0, 1000.0, 956340, 1.0000. 
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5. ARITHMETICAL COMPUTATION WITH ROUNDED NUMBERS 


Consider the following series of products with successively 
rounded values of a =3.1415927 and e = 2.7182818, whose 
product, correct to eight significant figures, is 8.5397342. 


wxXe PRODUCT CORRECT VALUE 
(3.1415927) (2.7182818) = §.338973425942286 8.53897342 
(3.141593) (2.718282) = 8.539735703226 8.539734 
(3.14159) (2.71828) = 8.5397212652 8.53973 
(3.1416) (2.7188) = 8.53981128 8.5397 
(3.142) (2.718) = 8.539956 8.540 
(8.14) (2.72) = 8.5408 8.54 
(8.1) (2.7) = 8.37 8.5 
(3) (3) =O. 9. 


The bold-faced figures are those which agree with the correct 
values on the right when the remaining digits are consolidated. 
Thus the first product is correct to seven significant figures only, 
for if rounded one place further to the right there would have 
been an error of 1 in the seventh decimal place, that is, 
8.5397348 instead of 8.5397342. Of the remaining products 
only three are correct to as many significant figures as occur in 
each factor, while three others are correct to one less figure. 
The table illustrates the rule that it is not safe to carry out the 
product of two such factors beyond the number of significant 
figures included in each. 

The same principle may be illustrated in another way. Sup- 
pose that the product of 36.9 by 8.74 is required, both factors 
being correct to three significant figures. The obtained product 
is 322.506. The maximum product is 36.95 x 8.745, or 323.12775, 
while the minimum product is 36.85 x 8.735, or 321.88475. In 
this problem it is therefore doubtful whether the correct answer 
is 322 or 323. To give the result to two significant figures, as 
320, would not be desirable, for both maximum and minimum 
products exceed the value. The answer 323 is to be preferred 
because it is nearer the average of the extreme Products and 
therefore more probably correct than 322. 
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When several factors are involved the rounding errors will 
tend to offset one another, but in case they do not, the error in 
the product may be relatively great, as illustrated by the fol- 
lowing example: 


Rounded value) ene 34° x 129" X 6.1845 — 29.754624 = 30 
Wilereieren WW 5. 6 . 6 oo 34.5 x .1295 x 6.7845 = 30.311450 = 30 
WWOberbane We 5 5 5 5 5 5 5 33.5% 1285 SoG osDl—29.20l aie — eo 


The best answer for this particular problem is probably 30, 
but if a dozen more items were included the product might be 
given to one more significant figure than in the item with the 
least significant figures. 

In the case of division, similar reasoning may be applied. 
Consider, for example, the quotient 8.47 + 23 = .368. The 
maximum and minimum values are 8.475 + 22.5 = .3877 and 
8.465 + 23.5 = .860, respectively. Here there is such wide varia- 
tion in the third figure that it could not be safely retained in 
the answer. The quotient .387 is probably best as an average 
between the extreme values .38 and .36. 

The general rule, then, is that a product or a quotient with 
rounded numbers should not be written to more significant figures 
than occur in the item with the least significant figures. 

For addition and subtraction it may also be shown that the 
result should not be written to more decimal places than occur 
in the least accurate measurement. Thus the sum of 624.2 feet 
and 49.173 feet should be written 624.2 + 49.2 = 673.4 feet, and 
the difference as 624.2 — 49.2 = 575.0 feet. By the same reason- 
ing as is applied in multiplication, it is evident that the maxi- 
mum and minimum sums are 673.4235 and 673.3225, whereas 
the corresponding differences are 575.0775 and 574.9765. 

When several items are added it is probably best to round 
them at once to the number of decimal places in the least accu- 
rate measurement. If compensating errors occur, they will off- 
set one another in the rounding of the items just as well as if 
more places had been retained and the sum rounded. The 
problem on page 71 illustrates this principle, 
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HYPOTHETICAL PROBLEM ILLUSTRATING ROUNDING IN SUMS 


ORIGINAL ITEMS ROUNDED TO Two DECIMAL ROUNDED TO ONE DECIMAL 
PLACES PLACE 

67.432 67.43 67.4 
9.64 9.64 9.6 

10.4 10.4 10.4 
8.356 8.36 8.4 

17.9 7e9 17.9 
6.566 6.57 6.6 
8.327 8.33 8.3 
7.463 7.46 7.5 
29.638 29.64 29.6 
19.784 19.78 19.8 
Total 185.506 185.51 185.5 


It is readily verified that the maximum and minimum sums 
are 185.6145 and 185.3975. The answer 185.5 is therefore the 
best, and it may be obtained as well from the last column of 
figures as from the second where the items have been carried to 
two decimal places and the sum rounded to one. The second 
decimal place is very doubtful in any event, and it is hardly 
worth while to retain two places in the items in an attempt to 
secure a better approximation to the correct result than could 
be obtained by rounding off to one place. 


6. LOGARITHMIC COMPUTATION WITH ROUNDED NUMBERS 


Inasmuch as a good share of the students’ calculations may 
be performed with the aid of logarithms it may be well to dis- 
cuss briefly their use with rounded numbers. As an illustration 
let the product 3.47 x 8.96 be required. The maximum and 
minimum factors f; and fo, their logarithms, and the resulting 
products may be set down as follows: 


fi fe Loe fi Loe fe Loe fife fife 
Maximum . 3.475 8.965 .5409548 | .95255038 |1.49385051|) 31.153 
Actual... 3.47 8.96 .5403295 | .9523080 |1.49263875| 31.091 


Minimum . 3.465 8.955 .5897082 | .9520656 | 1.4917688| 31.029 
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The best or most probable answer is 31.1, which is the product 
of 3.47 and 8.96 carried to three significant figures, and in order 
to obtain it four-place logarithms are as satisfactory as the 
seven-place. The abbreviated computation would then be 


log 3.47 =. .54038 
log 8.96 = .9523 
log prod. = 1.4926 

-, prod. = 31.1. j 


When four significant figures are involved, four-place loga- 
rithms may be employed, but a five-place table is much more 
convenient because no interpolation is necessary if the entries 
for N are given to four places. For example, the computation 
of the product 123.7 by 96.45 may be done in either of the fol- 
lowing ways: 


WITH A FouR-PLACE TABLE WITH A FIVE-PLACE TABLE 
AND INTERPOLATION (No INTERPOLATION) 
logs 2se(me—= a. 0925 log 123.7 = 2.09237 
log 96.45 = 1.9843 log 96.45 = 1.98430 
log prod. = 4.0766 log prod. = 4.07667 
wae DLOds = 11,950) DELO — se oo0e 


With a product such as 34.79 by 7643.29, the second factor 
should be consolidated to 7648 or 7643.3 and a five-place table 
of logarithms employed. 

The general rule that will apply also in the case of division is 
that when n is the least number of figures to which any of the items 
is correct, an n or at most an n+ 1 place logarithm table should 
be used. 

In logarithmic calculation involving formulas the same general 
rule may be followed. Thus in the case of the functions 1 — r2 
and V1-—~r?, which occur very frequently, three-place, four- 
place, or five-place logarithm tables will be ample when the 
values of r are given to two, three, and four places, respectively. 
The following calculation illustrates the variations which may 
occur in the numbers and logarithms: 
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r i—;s Loe (1 — r?)* 
Value Min. Error Value Max. Error Value Max. Error 
18 L775 .005 .9676 .9694 .0018 9.9857 | 9.9865 -0008 
.50 495 -005 -7500 -7550 .0050 9.8751 | 9.8779 -0028 
.82 -815 .005 .3276 .0358 .0082 9.5153 | 9.5261 -0108 
n99 .985 -005 .0199 .0298 .0099 8.2989 | 8.4739 ol tEY) 


It will be noted that when the value of r is given correct to two 
places, the logarithm of (1 — r?) may have an error in the first, 
second, or third place, etc., depending upon the size of r. Three- 
place logarithms of (1 —r?) would therefore be sufficient for 
such problems. 

In a similar way it may be shown that while a product such 
as [1 — (.856)?][1 — (.943)?] may have a rounding error of only 
.00035 there may be an error of .005 in its logarithm, due to 
an addition of the errors in the two factors, as shown below. 


[1 — (.8565)?][1 — (.9485)?] = .02925, 
[1 — (.856)?][1 — (.948)?] = .02960, 
[1 — (.8555)?][1 — (.9425)?] = .02995. 
Maximum rounding error = .00035. 


log [1— (.856)?] = 9.42694—10 log [1—(.8555)?] = 9.42833 — 10 

log [1— (.948)?] = 9.04435 —10 log [1—(.9425)?] = 9.04803 — 10 

log prod. = 8.47129 — 10 log prod. = 8.47636 — 10 
Maximum error in log prod. = .00507. 


The product to be chosen is surely right when written to one 
significant figure, as .03, but it might be correct to three sig- 
nificant figures, as .0296, on account of compensating errors. 

In case it is desired to have the final answer for a problem 
correct to n significant figures, it is usually best to begin with 
the items correct to n +1 significant figures and use n + 2 place 
tables in the computation. 


* Karl J. Holzinger, Statistical Tables for Students in Education and Psychology. 
The University of Chicago Press, 1925. See Table VI for log (1 —r?). 
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7. ERRORS IN EDUCATIONAL MEASUREMENT 


The errors discussed thus far have been due chiefly to the 
rounding of approximate measurements. They are not peculiar 
to any one field, but occur whenever measurements or observa- 
tions are made and should be taken into account in the sub- 
sequent calculations. Being unbiased in character their effect 
upon the final result may be controlled by care in the arith- 
metical operations as described above. The present section will 
be concerned with errors which occur in the measurement of 
mental characters. 

One difference between mental and physical measurements 
arises from the nature of the scales employed. Arithmetical abil- 
ity, for example, is a very complex character and its resolution 
into component abilities such as those of addition and multipli- 
cation is at best a matter of convenience because each of these 
is a combination of still more specific abilities. A unit of such 
arithmetical ability can therefore never be quite the equivalent 
of another unit in the arithmetical scale in the same way that 
an inch of height is the equivalent of another inch of height. 
Even two problems alike in type and equally difficult for a large 
group may not be equally difficult for a single pupil. The inch, 
on the other hand, has the same significance for the individual 
measurement as in the group. 

This lack of equivalence of test units is closely related to 
another difference between mental and physical scales. The 
complete measurement of a mental trait is probably impossible, 
because the test must always be based on a sampling of the total 
available material. Spelling ability, for example, may be meas- 
ured by a number of well-known scales, but no single test nor 
the combination of several tests will give a complete measure 
of spelling ability. These tests, moreover, will be only roughly 
comparable because different words and methods of testing are 
employed. An approximate transmutation from one mental 
scale to another is always possible, but nothing approaching 
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the exactness with which inches may be converted into cen- 
timeters can probably ever be attained in the case of mental 
measurements. 

The examiner or observer in giving a mental test may intro- 
duce certain errors by his failure to follow the uniform directions 
for the administration of the test. He may create an unfavor- 
able mental attitude on the part of the pupils by hurrying them 
or urging them to be overcautious. In scoring the results he 
may make mistakes in using the key even with objective tests, 
or show poor judgment in rating the specimens in the case of 
product scales such as those for handwriting and composition. 

Another source of error in mental measurement is associated 
with what Professor Pearson has called static as distinct from 
dynamic characters. The former include such physical traits as 
height and weight, the measurement of which is direct and does 
not depend upon the attitude of the person at the time of ex- 
amination. Dynamic characters like lung capacity, strength of 
grip, or intelligence must be measured indirectly by some form 
of reaction, and therefore depend upon the bodily or mental 
fitness of the individual. The measurement of dynamic traits 
thus gives rise to a variability in reaction which may be called 
response error of the person tested. 

It should be noted that when a pupil has been examined 
several times on equally difficult forms of a test, any change in 
his response may be due in part to the attitude of the examiner, 
to imperfections in the test material, to practice effect, to fluctu- 
ations in emotional status and fatigue, etc. Response error as 
measured by variation in score may thus be a combination of 
several of the types of error already discussed. Certain formulas 
which attempt to measure response variability freed from other 
error are presented in Chapter XIII, section 9. 

As pointed out in the second section, the best approximation 
to the true value of any quantity is given by the average of a 
number of observations. For dynamic characters involving re- 
sponse error this conception of true value may be misleading. 
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If a person has been tested ten times on as many equivalent 
mental scales the average score may be the most typical one, 
but the highest score is likely to be the best representation of his 
true ability because on that performance there were fewer in- 
terfering factors which prevented him from doing himself full 
justice. The same argument might be made with regard to 
characters such as lung capacity. No matter how often the 
test is given the full lung capacity will never be registered, and 
the largest volume obtained may be considered as nearest the 
true result. 

With standardized tests both the average (most typical) and 
the highest (nearest the true) scores will be useful, the former 
giving the best prediction as to future performance, and the 
latter the best indication of potential ability under most favor- 
able conditions. 

The above types of error in calculation and measurement 
may be briefly summarized as follows: 

1. Unbiased or rounding errors to be taken into account in 
calculation. 

2. Biased errors such as those found in teachers’ marks. 

8. Errors of the scale: 

a. Non-equivalent units or items; 
b. Inadequate sampling of available material. 

4, Errors of the examiner : 

a. In giving the test ; 
b. In appraising the results of the test. 
5. Response error (or variation) of the examinee. 


EXERCISES 
1. Round off the following numbers to four significant figures: 
35.675002, 846742., 390000., .6744898, .003674378. 


2. If the numbers 39.2 and 18.3 are correct to three significant 
figures, justify the product 717. rather than 700. 


HINT. Use maximum and minimum products. 


3. Justify the quotient 18.3 + 39.2 = 0.467. 
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4. Show that the sum of 13.26818, 138.36, 78.423, 7238.4289, and 
6.324 cannot be as large as 7474.82 or as small as 7474.79. 


5. Show that the product of 34.68 and 4.6, carried to three digits, 
lies between 158 and 161. 


6. Find the probable values of the following: 
a. Sum of 27.848, 182.6, 5478.29, and 5.2777 
b. Difference between 367.19 and 173.4395 
c. Product of 897.5 and 0.08 
d. Product of 37.846 and .0004 
e. Quotient of 87.846 divided by .0004 
f. Quotient of .0004 divided by 37.846 


7. Calculate the following products, using Holzinger’s Tables VI 
and VII and a five-place logarithm table of numbers. Repeat the 
calculations, rounding to four-place logarithms throughout, and com- 
pare results. 


ANSWERS 
a. [1 — (.846)2] [1 — (.931)2] =.1178 
b. [1 — (.845)2] [1 — (.674)2] =.1561 
e. [1 — (.118)2] [1 — (.981)2] = .0372 
d. \/[1 — (.639)2] [1 — (.846)2] = .4101 
e. \/[1 — (.550)2] [1 — (.947)2] = .2683 
f. V[1 — (.600)2] [1 — (.400)?] = .73832 


8. Discuss the theory of ‘“‘most typical”? and “nearest true”’ 
scores given in section 7. Do you agree with the distinction and 
use described by the author? If not, why not? 


9. Can the measurement of mental abilities ever be made as 
exact as the measurement of physical objects? Explain. 


10. Estimate the absolute and relative error made in measuring 
a person’s height with an ordinary yardstick. Estimate the absolute 
and relative error made in measuring a person’s intelligence, by a 
good group test and also by a good individual test. Use any data 
available to assist in these estimates. 


CHAPTER VI 
AVERAGES 
1. INTRODUCTORY 


It has already been shown that the first step in making a long 
series of observations comprehensible is to arrange the data in 
the form of a frequency distribution. This enables one to see 
some of the more outstanding characteristics of the series at a 
glance, and at the same time makes subsequent calculations 
very much easier than they would have been with the data 
ungrouped. 

The hypothetical distributions shown in Fig. 21 reveal cer- 
tain important features by mere inspection. Curves (1) and 
(2) center about the value 15, which is a measure of type or 
average, but the first distribution is spread out more than the 
second. This second characteristic is known as dispersion, or 
variability. Distributions (3) and (4) are said to be skewed, 
the former negatively and the latter positively. Curve (5) is 
very steep (leptokurtic), whereas (6) is flat-topped (platykurtic). 
The first distribution, which is midway between the two, might 
be regarded as mesokurtic. 

All these characteristics are very important in ae 
analysis and they may all be quantitatively determined by 
appropriate formulas rather than by inspection of diagrams 
as illustrated in Fig. 21.. In the present chapter methods will 
be presented for the calculation of several important averages, 
which include the mean, median, and mode. Measures of dis- 
persion and skewness will be discussed in Chapter VII. The 
kurtosis of a distribution is so rarely studied that no formulas 
for its measurement are given in this text. Such formulas, how- 


ever, may be found in Kelley’s Statistical Method. 
78 
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2. CALCULATION OF THE MEAN 


The most important and generally most reliable average hap- 
pens also to be the best known. This is the arithmetical mean. 
It is defined simply as the sum of the values of the observations 
divided by their number, or by the formula 

SX Mean for 
ASS Nae Hoare ee (5) 
where M is used to represent the arithmetical mean, X a value 
of the variable, and N the number of items. The symbol 


Oo 5 10 15 20 2 30 3 40 45 50 5 60 6 70 7% 80 8 90 


Fig, 21. Illustrating variations in central tendency, dispersion, 
skewness, and kurtosis 


means ‘‘ the sum of all quantities as follows,” that is, the sum of 
all the X’s. One property of the mean which follows at once 
from the above definition is that it is the magnitude each item 
would have if all items were the same size. 

The calculation of the mean for ungrouped data is very 
simple. It is only necessary to add the items and divide by their 
number. For long series, however, this process becomes very 
tedious and errors in addition are likely to creep in. Calcula- 
tion from the frequency distribution therefore becomes almost 
imperative with many items. The method will first be illus- 
trated by the use of a short series which has been so selected 
that the attention of the student will first be directed to the 
method rather than to lengthy arithmetic. Needless to say, the 
series is too short for the average to be of any practical value. 
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Let the mean of the following scores be required: 97, 72, 63, 
68, 98, 84, 79, 87, 56, 52, 64, 71, 75, 67, 64. The total of these 
items, 2X, is 1092 and their mean is 72.8. This is the true mean 
within the limits of the accuracy of the data. 

Next, assuming the scores are correct to the nearest unit 
only, we shall arrange them in a frequency distribution as 
follows: 


CLASS » FREQUENCY 
8930-997 bs Aion eece See cpt cae cee eee ee 2 
1923-89 Di M feed oe ek eae eo anes 2 
695551925 es as, ehes ne hac a ee 4 
5925=69:5 Wee Aaa. ols os ee deg Pies 5 
AQ -5=59 Daca. wile net ukcene at ere aie ee 2 

15 


For purposes of calculation it is assumed that the frequencies 
are concentrated at the mid-points of the respective class in- 
tervals, such points being known as class values (Chapter II, 
section 8). The two top frequencies will thus contribute 
2 x 94.5 = 189.0 to the total instead of 97 + 93 = 190, and so 
on for the other classes, the complete calculation being 


EX’ ff fx 

94.5 2 189.0 

84.5 2 169.0 

TAs ad =e 295.0 = Sx 1087.om 

64.5 5 322.5 RS N~ a6 Sr 
54.5 2 109.0 


— 


5 1087.5 = 2fxX 


It is evident that the sums > X and LfX differ by 4.5, a dis- 
crepancy which is due to the fact that the frequencies were 
taken at class values instead of at observed values. With a 
longer series and more class intervals the above discrepancy 
would be smaller, because the larger number of unbiased errors 
would tend to compensate, and with a narrower interval less 
variation from the class values would be possible. The means, 
it will be noted, differ by only 0.3 in spite of the short series 
and coarse classification. It should also be observed that when 
X represents the same series of values the quantities 2X and 
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2fX are algebraically the same, f being merely a symbol of 
operation showing that the X’s were added in frequency groups. 

The above calculation may be considerably shortened by 
selecting an assumed mean, A, near the middle of the series as 
origin and measuring the variable in units of class intervals. 
We shall take these two steps separately to show their individual 
effect upon shortening the calculation. 


f x’ fx? f d=7 fa 
7a} 20 40 2 2 4 
2 10 20 2 1 2 
Meh ek (mae FO 4 0 0 
5 240. 960 5 a ais 
bz, 20 40 2 4 ee 
15 —30=2fX’ 15 —8=2fd 


The X’ series, or ‘“‘reduced series,’’ has been obtained by sub- 
tracting 74.5 from each of the X’s in the preceding illustration. 
In order to obtain the mean from the calculation on the left it is 


necessary to add 74.5 to the mean of the X’ values since each has 
been diminished by that amount, that is, M = 74.5 + 2 = 72.5. 

It will be noted that the X’ values are replaced in the work at 
the right by d values, which are obtained by dividing the X’s 


by the width of the class interval, h. In obtaining the mean of 


the whole series, therefore, the mean of the d’s, or oo must be 


multiplied by h, before being added to the assumed mean, A. 
The work will then be M = 74.5 + == x 10 = 72.6. 


Some students may understand the above method more 
clearly by the following algebraic proof. From the definition of 
X’ we have 


X’=X-—A, 
so that X = At xX’. 
Furthermore, Ki tdh: 


Hence X=A+dh. 
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Summing over this expression (or adding member by member as 
many equations of this type as there are cases), we obtain 


ZX = 2ZA-+ 2dh. 


Dividing by N, factoring out h (which is a constant throughout 
the summation), and noting that 2A = NA, we obtain the re- 
quired formula, ZX Zid Apo 
neh N ay Ge )n neti) (6) 

The symbol of operation, f, has been inserted for convenience. 

We shall next take a somewhat longer series in order to review 
the above procedure and note a check on the work. The follow- 
ing scores were made by a class in statistics on the Otis Self- 
Administering Test : 


TABLE 18. ILLUSTRATING THE CALCULATION OF THE MEAN WITH CHECK 


CHECK 
CLASS f d fa 
d’ fa’ 
69.5-74.5 6 5 30 4 24 
64.5-69.5 2 4 8 3 6 
59.5-64.5 3 3} 9 2 6 
54.5-59.5 6 2 12 1 6 
49.5-54.5 10 u 10 0 0; A= 52 
A=A47 44,5-49.5 23 0 0/—1 — 23 . 

39.5-44.5 8 |}-1 — 8|-2 —16 
34.5-39.5 4 |}-—2 — 8|-3 —12 
29.5-84.5 4 |-3 —12|-4 —16 
24.5-29.5 ey it -— 4)-5 = 5 
Ni 167 2 jd = 37 Lfd’ = — 30 

M=47+33x5=47+4+ 2.76 = 49.76 

M (Check) = 52 — 83 x 5= 52 — 2.24 = 49.76 


In the first of the above calculations for the mean the origin 
is taken at 47, opposite the largest frequency, 23, because it 
looks as if this would furnish a small Dfd. The d’s are then 
tabulated 1, 2, 3, --- and —1, —2, —8, - - - from this point and 
the fd products formed. The remainder of the calculation con- 
sists in substituting Zfd = 37 in formula (6), where A = 47, 
IN == 61,,and P=, 
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The check on the right is made by selecting a new reference 
point or origin and repeating the calculation at least up to the 
quantity Zfd’. If the new origin differs from the old by one 
class unit, Zfd’ will differ from Zfd by N. This can be seen by 
inspection or shown as follows: 

dma’ + 1, 
Zid = Zfd/ + Zf, 

“. Lfd = Lfd' + N. {Check on mean} (7) 
In the above example, d’=d—1 and Zfd’ should equal 
Zid — N, which it does, checking the work to that stage in the 
calculation. The student is warned not to forget to multiply the 
quantity Zfd/N by the width of the interval h. Failure to do so 
is detected by carrying the check computation through to the 
final result. It is therefore desirable to use the complete check 
until the student is confident of the accuracy of his calculations. 


8. PROPERTIES OF THE MEAN 


The arithmetical mean has several important properties 
which should be noted. First of all, it is rigorously defined in 
algebraic terms and is based directly on the actual values of all 
the items. This makes it possible to obtain a definite average 
for any quantitative series, and gives a result which is truly 
characteristic of the whole distribution. 

The algebraic character of the mean makes possible the com- 
bination of averages from several series. Thus, if X1, X2, and 
X,, denote the variables in three different groups of size Ni, No, 


y 
and Nz, the three means will be M, = ———? Mz, = » and 
M;= ERs, The mean of all three series is the sum of all the X’s 

3 
Ape . ; DXi+2Xot2 xX; 
divided by the total number of items, or M=——1<“2T<<3, 
y Ni+No+ Nz 
This result may be obtained from the individual means by 
multiplying each mean by the size of its group and dividing 
the sum of these products by the total number of items, 
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NiM, + NoMo+ N3Mz_ 
Ni+ No+ Ng 
of great advantage in combining norms from different localities, 
for it would be necessary to know only the means and the 
number of cases in each group. 
It should be noted that the mean of several means is rarely 
the average of the items on which the separate means are based. 


This can be seen by a very simple example. Let M; = ; =e 
20 15 


M,=-—=4, and M3=3 =5. The mean of all the ttems 


is Va | or 4.1, while the mean of the three means is 


38+4+5 
3 


that is, M = This property might be 


» or 4.0. Both results are entirely correct, but repre- 


sent quite different things. The reason for the discrepancy 
may be seen by noting that if the three means are different, 
NiMi + NoMo+ N3Mz3 _. Mi+ M2+ M3 
arena See Aa ae > will equal eae ioer 
the three N’s are equal. Since it is usually the mean of the ztems 
which is required, the averaging of averages should ordinarily 
be avoided. 

Another property of the mean appearing from the definition 
is that every item, large or small, contributes its proportionate 
share to the result. This is regarded by some as a defect since 
extreme items appear to have an undue effect upon the mean. 
Against this objection it might be argued that if such extreme 
observations belong in the series, they should be permitted to 
contribute their full share. 

The algebraic properties of the mean are of further impor- 
tance in mathematical statistics where this average enters as 
a parameter in many formulas. It is probably as valuable in 
this respect as any other statistical constant. 

A final, and in some respects the most valuable, property of 
the mean is its stability under fluctuations of sampling for or- 
dinary distributions. If samples are drawn from a large body 


only when 
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of material and a number of means calculated, they will usually 
be closer to the mean of the whole material than if any other 
average had been employed. This property is often character- 
ized as the reliability of the mean (see Chapter XIII). 


4. CALCULATION OF THE MEDIAN 


The median for ungrouped series has already been introduced 
in connection with the classifier described in Chapter II. It is 
the middlemost value of the variable when the observations 
are ranked in order of size, or the magnitude such that greater 
and smaller values occur with equal frequency. For an odd 
number of observations without ties in ‘rank, it is clearly the 
magnitude of the middle observation. For an even number of 
cases any value between the two middle items will satisfy the 
above definition, but it is customary to take as the median the 
average of the two middle values. In case there are ties in rank 
near the middle of the series, a weighted average is sometimes 
used as illustrated by the following observations: 1, 3, 5, 9, 
1Oe12712512)14) 14,15, 16,.17, 21, 23,25. ‘The value halfway 
between 12 and 14 might serve as the median, but the weighted 
mean of the middle observations would seem to give a little 
more stable result. The median in this example is thus 

3x 12+2x 14 
5 


When there are a sufficient number of observations to warrant 
the use of a frequency distribution, the above difficulties do not 
arise. The histogram of the Otis scores from Table 13 will illus- 
trate the procedure in this case. Under this representation the 
frequencies are assumed to be spread evenly over the class in- 
tervals, the areas being exactly proportional to the number of 
items between any two class limits. The median is now to be 
regarded as the value of the variable on either side of which half the 
frequencies lie. The graphical solution amounts to determining 
the point on the scale the vertical through which bisects the 


a ilraseh, 
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area under the histogram. It is thus only necessary to count in 
the frequencies from either end and interpolate across the in- 
terval containing the median. 

From Fig. 22 it will be noted that half of the frequencies is 
33.5, so that the problem is to determine the point above and 
below which 33.5 frequencies lie. Counting up from the lower 
end of the scale it is 
apparent that 17 fre- 
quencies lie below 44.5 
and 40 frequencies lie 
below 49.5. The me- 
dian therefore lies some- 
where between these 
two values. The differ- 
Fic. 22. Illustrating the median for the scores ence 33.5 — 17 = 16.5 

on the Otis Self-Administering Test gives the number of fre- 
quencies beyond 44.5 
necessary to reach the median. From the rectangles in the dia- 


gram it is apparent that = = a so that the required distance, 
16.5 


x, 1s 93 <x 5=3.6. The median is therefore 44.5 + 3.6 = 48.1. 


The work may be checked by counting down from the upper end 
(33.5 — 27) a 
= Osan en Fe ei 


Using certain abbreviations, we may write two formulas for 
calculating the median in the case of the frequency distribution. 
The term median interval is used to designate the class interval 
which contains the median. Let 

u.l. and lJ.l.= upper and lower limits of median interval, 
for example, 49.5 and 44.5 in Fig. 22, 

fup and fao = total frequency up to and down to median in- 
terval, for example, 17 and 27, 

fma = frequency of median interval, for example, 23, 

h = width of class interval, and 

Md = median. 


Frequency 


25 30 35 40 45 |50 55 60 65 70 15 
Md. Otis score 


of the scale, giving median = 49.5 — 
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The formulas then become 


distribution + (8a) 
counting up 


N 
oo is) { Median for | 
al 


and Md = u. I. — 


N 
oe Sao Median for dis- 
h. + tribution count- + (8b) 
ing down Jj 
If the student finds it easier to do the calculation by a series 
of steps, the following may be useful : 


1. Divide the number of cases by 2, iC — 83.5). 


2. Determine by inspection the interval containing the 
median, (44.5 — 49.5). 
3. Count the frequencies up to the median interval, (fu»=17). 


4. Subtract this last result from > (83.5 — 17 = 16.5). 


5. Multiply the last result by the width of the interval and 

divide by the number of frequencies in the median interval, 
OVD cD 
= im 3.6). 

6. Add this quantity to the lower limit of the median interval, 
thus obtaining the required median, (44.5 + 3.6 = 48.1). 

A similar series of steps may be written out for the calcula- 
tion when counting down from the upper end of the scale. 

It has been noted that the median determines the point on 
the horizontal scale the vertical through which bisects the area 
of the histogram. The mean, on the other hand, is the point at 
which the histogram would balance. It is the center of gravity 
of the distribution. The fd’s correspond to the moments in 
physics (force x distance), and the mean or center of gravity 
occurs where 2fd = 0. 

The table on page 88 shows the complete calculation of the 
mean and median for a longer series. It will be noted that 
the frequencies are given at central ages 45, 44, etc., or classes 
44,.5-45.5, 43.5-44.5, etc., since all ages were tabulated to the 
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nearest year. The data show the ages at which a group of 
college professors listed in Who’s Who received their Ph.D. 
degrees. All these men had an A.B. but none an A.M. degree. 


TABLE 14. ILLUSTRATING THE CALCULATION OF THE MEAN AND MEDIAN 


nen Ris CHECK 
CEIVED PH.D. f : fa d fa’ 
‘ 

45 res leet 48| 17 51 
4a es BE eect ce 
43 3|/ 14 42| 15 45 
42 Ziuais 39| 14 42 
41 tieete 13) 43 13 
40 B11 Bbil) si2 60| M=294+—22 x 1=28.97 
39 190 9} 10 90} 11 99 400 
38 eases | 5| 9 45| 10 50| w=284 388 x 1=28.97 
37 above+ 5 8 40 9 45 400 
36 Md. q et 49 8 56 
35 Int. 11 6 42 7 49 
34 10] 5 50 6 60 
33 18| 4 52 5 65 
32 17) °3 51 4 68 
31 29| 2 58| 3 87| Md=27.5 +22 x 1=48.18 
30 42 1 42| 2 84 27 
29 31 0 ee ad 31 
28 oT 1 lo e7 IL 16 0| Md=28.5 —12 x1=28.13 
27 aT | 7a) —1 | S57 27 2 
26 isg_| 84] —8 |—162| —2 |-108 
25 cases |88| —4 |—152| —3 |-114 ‘ 
24 helow4 29| —5 |-145] —4 |—116 
23 Md. \t4| —& | —84| -5 | —70 
22 rae ee ET | Eas el i 
21 : 2 aG) 1 1G in a Ye | ead 
20 SNS ouverte ik ea cae 

400 —12 388 


5. PROPERTIES OF THE MEDIAN 


The lack of rigor in the definition of the median for undis- 
tributed series has already been noted, and in this respect the 
mean is clearly superior. For large bodies of data, however, in 
which the use of the frequency distribution becomes imperative, 
no difficulties as to rigid definition are likely to arise.* 


* Note that the median for grouped data becomes indeterminate when the fre- 
quency of the median interval is zero. This form of distribution, however, is very rare. 
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The median is based only indirectly upon all of the obser- 
vations inasmuch as it is determined by their relative size. 
Whether or not this is an advantage over the mean depends 
upon the particular purpose for which the average is used. 
Under ordinary circumstances all items should contribute fully 
if included at all and the mean is therefore generally superior. 

In combining the averages of several series the mean has a 
great advantage over the median. A simple combination of the 
separate means and totals as shown in section 3 will furnish the 
mean of the entire group of items, while in order to determine 
the grand median it is necessary to combine all of the separate 
distributions into one and calculate from this. As regards other 
algebraic properties the median is again inferior since it cannot 
be employed in connection with the formulas of higher statisti- 
cal analysis. 


The reliability of the median, or its stability under fluctua- ~~ 


tions of sampling, is in general less than that of the mean. Only 


for very peaked or leptokurtic distributions of the type illus-') A--~ 
trated in (5) of Fig. 21 is the median superior in this respect.* 
The advantage thus far appears to be entirely in favor of the * 


mean, but the median has at least two points of superiority. It 
is easier to calculate for both long and short series, and in the 
case of ungrouped data the middle item which furnishes the 
median can be uniquely identified and will remain the median 
item under any other form of measurement. Thus the height 
of the eleventh man in a group of twenty-one is typical of all in 
a very real sense, while the mean of the series will very probably 
not correspond to the height of any particular individual. 

For the large bulk of test data the norms, or average scores 
for unselected groups, are given in the form of the medians. In 
using such tests and in making comparisons it is therefore neces- 
sary to use this form of average. For most problems, however, 
the mean is distinctly superior and should be used unless there 
is some very good reason to the contrary. 

*G. U. Yule, Introduction to Statistics, p.339. C. Griffin & Co., London. 
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6. THE CRUDE MODE 


The modal value of a variable is the value of the most fre- 
quent occurrence. Thus in Fig. 21 the modes are the abscissas 
corresponding to the highest points of the curves. For grouped 
series it is possible to obtain only a crude mode, which may be de- 
fined fas the class value of the group with the largest frequency. 

The crude mode is obviously unstable inasmuch as it will de- 
pend upon the fineness of classification used in grouping the 
data. By widening or narrowing the class interval, the mode 
may be made to shift very considerably up or down the scale. 
It is therefore to be used only for rough inspectional purposes. 
Its great advantage, of course, lies in the fact that it can be 
determined at a glance. 

The following distribution shows two crude modes for the 
A.B. to A.M. spans of a group of college professors. The spans 
or years elapsing between degrees are again given at class values. 


YEARS BETWEEN A.B. AND A. M. DEGREES f Pi 
sev Coed ein 5 COSI Sea laa alae y 
TEy eee, ee ead eee oe 3} 6 
Ut Meier aeeremamae das testes.» | 22" 
AUR cs tis pe en ae 23 
Ra ie OR ee ee 68 
eh ee ce) | eine ees 238 
(EON eid hs ee ae 152 152 

515 515 


In the first frequency distribution given by f, the crude modes ~ 
appear at one and at three years. Grouping by two-year in- 
tervals brings a single mode at two and one-half years. The two 
crude modes are of greater practical interest in this example 
because they show that if a graduate student fails to get his 
master’s degree in one year, he will very likely take three years 
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instead of two. This is probably due to the fact that he has had 
to leave the university for a time upon failure to complete the 
work in a year or that after taking the bachelor’s degree he 
waited a year or two before working on the master’s degree. In 
occasional problems of this sort the crude mode is of interest, 
but in general some other average should be used. 


7. THE GEOMETRIC MEAN AND GEOMETRICAL SERIES 


Geometrical series, which were introduced in Chapter IV, will 
now be considered more generally and applied to some statistical 
problems. The geometric mean of a series of observations is|the 
value obtained by finding the product of all the observations, 
and then obtaining the root of that product with an index equal 
to the number of items in the group. | Thus the geometric mean 
of the values X1, Xo, X3, --- Xn may be defined by the relation 


QUE NPG SO Ge a an eee (9) 


f i mean 
or in terms of logarithms ne 
8 y ( Logarithmic | 


log (G.M.) == 2 log (X), + form of geo- t (10) 
1 metric mean j 
the latter form furnishing the usual scheme of calculation. 

It will be noted that the geometric mean becomes zero if any 
of the X’s are zero, and may become imaginary if negative 
values occur. As shown in most texts in algebra, the geometric 
mean of a series will always be less than the arithmetic mean. 

A geometrical progression has been defined in Chapter IV as a 
series of terms such that each term is the product of the pre- 
ceding term by a constant factor called the ratzo. In the geomet- 
_ rical series 8, 12, 18, 27, 40.5, this ratio is clearly 1.5, and the 
geometric mean of the whole series is 


V8 x12 X18 X 27x 40.5 
— ¥/8 X 8(1.5) X 8(1.5)2 X 81.5)? x 8 (1.5) 
— V/85(1.5) = 8 x (1.5)2, 
or Giie= 6x 2.25 = 18. 
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That this result should follow appears at once from the more 
general form of the geometrical series a, ar, ar?, ar®, ar*, ete., 
the G.M. of the first five terms being 


5 5 
Va-ar-ar?- ar? - art = Vaer! = ar?, 


or the middle term, which is 18 in the example on page 91. 

The arithmetic mean of the above items is 21.1. In order to 
show the relationship between these two means, they have been 
plotted with the data in Fig. 28. As a general average of the 


five numbers in the series 
ERREP Taree the arithmetic mean is 
Bee oer ye quite adequate, but as a 
Ee AT SRaee measure of the average 
HH 


45 


item in such a trend the 
geometric mean only is cor- 
rect. This is further illus- 
ale 5 Mi. piss ies crite of the 

10 i st tour terms. e A.M. 
BE ARPEaer Ss is now 16.25 and the G.M. 
mean 14.70, the latter being 


Fic. 23. Illustrating the geometric and again a point on the smooth 


arithmetic means for four and five items CUrve connecting the items 
in the series. 


The geometric mean is thus useful in determining averages in 
historical trends where the items form something like a geo- 
metrical series. The cost data in Table 15 furnish an example 
of this sort. 

From the nature of a geometric series it is apparent that the 
ratio of each term to the one just preceding is equal to the con- 
stant ratio r. Applying this test to the cost data, a series of 
fairly equal ratios is found, showing that the original items 
form approximately a geometrical progression. In determin- 
ing the expenditure at any point in this trend, the geometric 
mean should therefore be used. The total yearly costs have 
been taken at the mid-year points, so that if the cost from the 
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TABLE 15. Cost DATA ILLUSTRATING THE USE OF THE GEOMETRIC MEAN 


EXPENDITURE 
FOR PUBLIC RATIO OF 
YEAR SCHOOLS IN THE EacH ITEM es nde 
UNITED STATES | TO ONE ABOVE 
(IN MILLIONS) 
TE oes ae 2800 
LOO omepacn icant e tls ae. sess < 1.047 246.9 
1903-190 Lae eae eae ee 252.8 1.061 2651 
USOASL9O barren et spe eel 273.2 1.081 284 8 
EIO5-CO0G oar wees eee, al ene 29156 1.067 305.8 
TIO LOO Terr. coe See ese 307.8 1.056 328.5 
LOOT 190 Sirrety ced aha SY Aer 336.9 1.095 352.8 
1 908—US 09M eters ce ee ee 371.3 1.102 378.9 
UN RD) ese co JSS on onee 401.4 1.081 Wee 
TOTO LOT Te era a ete sche hc 426.3 1.062 437.0 
DOM 10 12 ree ees) eee ae 446.7 1.048 4693 
OT 19d ae eek ans, We or 482.9 1.081 504.1 
TOS eS a ee am Be bn 521.5 1.080 BALA 
LTA LOL Gwar toe oben oe a 555.1 1.064 581.4 
OTD 19 1 Gite a cera en ese es 605.5 1.091 624.5 
POLG 191 7m sere es ccs 3 Naess 3-16 640.7 1.058 670.7 
UGE DSS 0 eee! See esd gs ae 702.2 1.096 7203 
EO SLO: 9 Meme Se we oye vs 763.7 1.088 773.6 
A.M.= 435.9 A.M.=1.0740 
G.M.= 406.9 G.M.=1.0739 


middle of 1909 to the middle of 1910 were required it could be 
approximated by finding the geometric mean of 401.4 and 426.3, 
or V171,116.82, which is 413.7, or, since we are figuring in mil- 
lions of dollars, $418,700,000. 

For the entire series the A.M. is 485.9 and the G.M. 406.9, 
while for the set of accompanying ratios the arithmetic and 
geometric means agree at 1.074. The correct method of averag- 
ing such ratios is by the geometric mean, but in the above ex- 
ample there is very close agreement between the two averages 
because of the even nature of the series. 

The theoretical series given in Table 15 was obtained by form- 
ing a geometric progression with a = 406.9 and r = 1.074, and 
extending it in both directions from the beginning of the year 
1910. It may be noted that an error in the fourth place of 
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these numbers may occur because of the cumulative nature of 
the errors in the powers of r. 

From the very good agreement between the observed and 
theoretical cost trends it is further apparent that the expendi- 
ture data form a good approximation to a geometrical series. 
From the years 1901 to 1918 the cost rose an average of 


Ss 
So Ss 


Millions of Dollars 
a om 
3s 


_ 
Ss 
Ss 


19012) 8) 4) bee G) 79 809) 0d 129 1814 ib be tye 1s aolvear 
Fic. 24. Plot of the cost data and theoretical series from Table 15 


7.4 per cent each year over the one just preceding. The char- 
acter of this increase is the same as that of a sum of money out 
at compound interest. 

The close agreement noted above might tempt one to extend 
the curve and predict future costs. Thus the expenditure in 
1922 might have been forecasted as (406.9)(1.074)!25. This 
gives 993.2 as compared with an actual cost expenditure of 
1,580.7 millions. The large discrepancy is due chiefly to the in- 
fluence of post-war conditions upon the purchasing power of 
money. In making the prediction it was assumed that the same 
factors influencing expenditures from 1901 to 1918 would con- 
tinue to operate in 1922. This assumption, as we have seen, is 
not valid. 
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8. THE HARMONIC MEAN 


_The harmonic mean of a series of observations is the reciprocal 
of the arithmetic mean of their reciprocals/ or if H (or H.M.) be 
the harmonic mean, 


1 1 1 : 
Fen » (3) - {Harmonic mean} (11) 
Yor the series 8, 12, 18, 27, 40.5 the harmonic mean will thus be 
given by 
ausletistigtat mg)= 00. 
H = 40. 5)= F 


Therefore H = 15.4. 


The work can be done very readily using a table of reciprocals. 
The three averages for the above data may now be written 


¢ 


H.M.=15.4, 
G.M.=18, 
A. M.= 21.1, 


which is the order of magnitude always found as shown by 
texts in algebra. 

The harmonic mean may be illustrated by a supposititious 
problem. Let us assume that five pupils worked an hour on 
some problems, with the results set down in two forms as follows: 


PROBLEMS WORKED IN AN HOUR MINUTES REQUIRED TO WORK A PROBLEM 

10 6 

8 7.5 
6 10 
4 15 

2 30 

M, 6 M;, 18.7 
eh, AUS Jek, al) 


If rand M, denote the rate and arithmetic mean rate at which 
the problems were worked, and ¢ and H;, represent the time and 
harmonic mean time in minutes required to work a problem, it 
is evident that 
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60 
r= Ge 
2r_egly =| 
and NeE 60 ae c ; 
an M,=60 & (6 = 60 x 75): 
STeries it: = 10 
imi 1 Sipe 
Similarly, M, = 60 A,’ (13.7 == 61) x i) 


The arithmetic mean rates and mean times may therefore be 
obtained by determining the reciprocals of the harmonic mean 
times and rates and multiplying by the proper constant. 

Two experimenters, for instance, might have recorded their 
results, one as rate and one in time, but by using the above 
relationships their averages could be made directly comparable. 
Thus in the above example the mean rate, 6, may be found 
by the arithmetic mean of the rates or by dividing 60 by the 
harmonic mean of the corresponding times. 

In general, if the test results are recorded as rates, M, should 
be employed; but if times are recorded, M; should be used. 
The corresponding harmonic means are chiefly useful in making 
results comparable when necessary. 


EXERCISES 


’ 1, Calculate the mean and median for each of the following 
frequency distributions: 


(1) 


(2) 


(3) 


CLASS if CLASS if CLAss VALUE fi 
94,5-99.5 al 36.5-388.5 1 10 1 
89.5-94.5 2 34.5-36.5 - 9 2 
84.5-89.5 3 32.5-34.5 3 8 5 
79.5-84.5 5 30.5-32.5 4 7 10 
74.5-79.5 7 28.5-30.5 10 6 12 
69.5-74.5 6 26.5-28.5 4 5 10 
64.5-69.5 4 24.5-26.5 3 4 4 
59.5-64.5 = 22.5-24.5 2 3 3 
54.5-59.5 1 20.5-22.5 2 2 =; 

1 2 
Mi teoe M = 28.81 M = 5.86 
Md = 77.00 Md = 29.20 Md = 5.96 
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(4) (5) (6) 
CLASS fi CLASS VALUE fd CLASS fi 
89.5-99.5 1 95 1 40.5-43.5 1 
79.5-89.5 i, 85 ~ 87.5-40.5 2 
69.5-79.5 5 75 3 34.5-37.5 D 
59.5-69.5 20 65 — 31.5-34.5 6 
49.5-59.5 16 55 5 28.5-31.5 7 
39.5-49.5 4 45 6 25.5-28.5 10 
29.5-39.5 5 35 ff 22.5-25.5 4 
19.5-29.5 2 25 = 19.5-22.5 3 
9.5-19.5 1 15 4 16.5-19.5 2 

5 7 
M = 57.36 M = 42.14 M = 29.325 
Md = 59.50 Md = 41.67 Md = 28.93 
(7) (8) (9) 

CLASS f CLASS f CLASS VALUE f 
84.95-39.95 1 10.25-11.25 2 1D 1 
29-99734.95 3 9.25-10.25 2 10.5 2 
24.95-29.95 7 8.25-9.25 4 9.5 4 
19.95-24.95 10 7.25-8.25 th 8.5 5 
14.95-19.95 4 6.25-7.25 8 (Qs) 6 

9.95-14.95 2 §.25-6.25 4 6.5 @ 
4.95-9.95 2 4.25-5.25 or 5.5 4 
3.25-4.25 1 4.5 2 
M = 22.79 M = 7.35 VE T56 
Md = 28.20 Md = 7.25 IG) = 


2. Calculate the means and medians for the data of Exercise 1, 
Chapter II, using class intervals of 10 for the Otis and Terman tests, 
and an interval of 5 units for the Chicago test. 


OTIS CHICAGO TERMAN 
NCAT MEAMP tet aie We foals: es bea ve 139.3 doul5 124.5 
Mediatiwer emir 6 ada SS 6 140.6 53.64 124.5 
Ans. 


3. Verify the means and medians for the distributions of the Army 
Alpha Test given on pages 98 and 99. The class values were taken 
at 207.5, 202.5, etc., which makes the averages .5 larger than they 
would have been if the intervals had been given as 204.5-209.5, ete. 
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VARIABLES: ALPHA SCORE X SCHOOLING. GRoUP I, II, III: WHITE DRAFT 
(NATIVE BorRN)* 


For men who took alpha only. 


GRADES HIGH SCHOOL 
ALPHA SCORE ; 
ip 8 1 2 3 4 
205-2095. -se. nee ew cee = = 1 = = 
200-204), Se cae s & = = = = = = 
195-199 eee = = 1 S 1 = 
190-194 . Ste - = = = 2 3 
185-189 . = 1 2 3 2 9 
180-184 . - 2 2 6 4 15 
175-179 . 2 5 4 6 4 15 
170-174 . 2 5 6 10 11 22 
165-169 . 3 8 8 7 10 38 
160-164 . - 12 18 12 10 48 
155-159 . 2 15 22 20 24 56 
150-154 . 7 29 36 30 29 63 
145-149 . uf 48 27 42 34 96 
PAQ=UA4 a teas cen ff 62 44 41 41 98 
TBH —LSOC Ne. eg ta oars 15 76 55 57 46 106 
HSO=ESE YS Bate ie 48 19 108 73 85 69 130 
2512 Ores eo, ton site 17 159 86 89 62 120 
WAS Ye re eo 24 164 92 94 74 121 
1151 LOVEE meng couse 36 249 113 129 80 148 
DUO-PE4 re ser cot ue es 52 309 136 126 91 151 
LOS-LO9wy Was at oe 66 384 173 146 83 140 
LOO=104 Fara eA: oe 97 430 168 174 97 135 
O5=GOM Fy ne taw ths 141 523 199 148 95 135 
DUEOSr) ac 3 Suet 170 624 209 174 97 105 
S689 Seno! Goat ae 187 661 230 201 107 110 
80284" ee i ss 247 756 232 167 84 103 
B19’ nm ch ee eae 326 811 248 137 82 87 
LOS (A ey boat ates. Grose 378 914 238 165 78 81 
65=69) (5 net “Seaes. 385 957 225 131 Be 70 
GO=64 i ee kee 499 989 246 146 64 57 
SSO sti ooh tes 594 1,057 178 114 38 83 
DOSG4) & caw Fim 611 996 161 95 21 32 
i5= 49) tee ee 650 937 143 vs 24 18 
CA a: 1: ee 660 845 107 55 24 21 
S5=S97 BG Masako: Osh ke 638 706 88 49 27 14 
SOR84 ere See a oe 636 642 80 36 13 16 
PASSAT AE Meant aly 511 461 45 31 12 11 
AV 2 eee Ge ara 380 281 27 12 8 7 
SEC eae eee ness 231 189 10 16 4 8 
LO-1A Tre Ai he ee 54 59 1 ff 4 = 
FM Wc ee Reg Nie See 44 34 1 2 1 1 
O24 ee hen ieee: 3 = 10 1 2 ~ a 
Total seth cugee see 7,701 14,518 3,736 2,838 1,614 2,423 
ica 

YY RUMOR Te ac ta te ee hae 53.874 68.287 83.842 90.366 98.823 109.881 
MGS att Sct Po 50.356 65.277 81.487 89.502 98.263 110.911 


* Data from Memoirs of National Academy of Sciences, Vol. XV, p. 748. pe 
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VARIABLES: ALPHA SCORE X SCHOOLING. GRoupP I, II, III: WHITE DRAFT 
(NATIVE Born)* (CONTINUED) 


For men who took alpha only. 


COLLEGE 
ALPHA SCORE 
il Ps 3 4 5 6 

BU0=200 sem So, a x = = - - — - 
200-204 . = - i 74 - - 
195-199 . 2 1 i. 5 1 1 
P9O-194 oe e's ih 2 2 5 - ib 
PS5=18 955s a sce 4 4A 4 i 2 ak 
180-184 . 2 3 2 19 1 - 
UG —-LTO me as ee aw on 9 8 6 23 2 1 
A ORUTA Pre. ue seo) ss 9 13 12 36 5 1 
66-1698 Gl cess coyi 16 14 14 38 i 2 
LOOSEGA Re ects ote 25 18 aby 44 4 3 
MOOS UO OMersn es Ga es a 22 29 28 39 2 al 
DOSLb4 SW See ss 29 34 31 AT 5 at 
WAG LEO Ve ee oars 33 31 28 53 6 2 
140-144 Sn ne tes 26 26 18 Al 3 - 
SOUS OU eect, org 8 51 37 21 29 3 2 
TSO USA re cist isy ees 42 50 27 29 i 1 
PASI Ss Gh Gao 62 38 29 35 4 1 
202A ee ee Ee 48 43 20 34 2 it 
EAS wt Sees 44 58 28 42 1 - 
| i ee hf OS ey 65 48 28 28 = 1 
USSS RU eae 52 36 23 21 6 74 
B00 —104T tem po veils) 6 44 33 22 ile 1 - 
95 OO ete kherew sc ys 58 25 24 26 2 

D0 OLN tn teas, 47 51 25 ie - - 
GREGE 6 6 fe & bau 6 55 35 21 10 - = 
SO- SA See ef toe Al 37 17 11 aq! = 
Tas reece eres. @ 39 29) 9 6 3 2 
OT res Bes aes esos 45 22 12 4 i 2 
G5 =O9R eeeecrnh ay a 29 29 13 13 = - 
GOO A memo nes ica 41 iyi 8 2 - = 
55-59 38 16 11 2 = = 
50-54 16 12 6 3 = - 
(Ee es. a Cee ota 23 11 8 3 a - 
A TAIT aD lope ess 2 “ae 11 4 3 - 1 - 
35-39 17 3 1 1 - - 
80-34 3 2 2 2 - - 
25-29 yA 3 = 2 - - 
20-24 1 4 1 - - - 
15-19 4 = = = = = 
10-14 - = - i = - 
5-9 = = ~ = = = 
0-4 = = = = = = 
“To tallfmerceasane, Towson: 1,056 829 523 707 60 26 
Vim Tce tse. sas ek 5 105.559 112.168 118.877 136.581 134.417 140.0 
IVE tains sins See eL06;046 114.427 119.911 141.890 143.333 147.5 

Ans, 


* Data from Memoirs of National Academy of Sciences, Vol. XV, p. 748. 
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4, Calculate the geometric mean for the following cost data: 


TOTAL EXPENDITURE FOR 


YEAR SCHOOLS IN UNITED 
STATES RELATIVE TO 1914 
TO 14 os she aie a ree tec ee 100 
A916: Sak. 8 S-Series aeons 115 
LOIS eosin Goled aye pe eeeeees ola meets eee ees 138 
1920 voce Seg See ee ee ee 187 
1922 Fo We cda core ager ce oil eee cat ee 285 
(G.M. = 153) ; 


Taking a = 100 and art = 285, compute r and construct a geometri- 
cal series of five terms (r = 1.30). Compare with the data. Should you 
conclude that the cost increased in geometrical progression during 
this period? 


CHAPTER VII 


MEASURES OF DISPERSION 


1. INTRODUCTORY 


The dispersion of a series of observations is/the degree of 
scatter, or the extent to which the items are spread out along 
the scale from some average value} It is important to have 


measures of such va- 
riability for’ several 
reasons, one of them 
arising from its rela- 
tion to the reliability 
of the average. 

In Fig. 25 two dis- 
tributions with the 
same number of cases 
and the same average 
are shown. In the 
case of curve (a) the 
observations cluster 
closely around the 


Fic. 25. Illustrating difference in dispersion for 
two series with the same mean 


mean, while in curve (b) they are spread out much more along 
the scale. It is therefore apparent that the average of the first 
distribution is more representative of the whole series, more 
typical of all the observations, and for the same number of 
items to be regarded as the more reliable. In comparing two 
or more averages it is necessary to have some measure of their 
respective reliabilities, and for this purpose a numerical repre- 
sentation of the dispersion of the series is first required. Appro- 
priate reliability formulas for averages and other statistical 
quantities will be found in Chapter XIII. 
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Another need for a measure of dispersion arises when a general 
measure of homogeneity is required. In teaching, for example, 
it is well known that when the students in a class differ widely 
in previous training and mental characteristics, instruction be- 
comes a very difficult problem. The measurement of the abili- 
ties involved and the quantitative appraisal of the variation in 
different groups make it possible to approach such a problem 
scientifically. Closely related to this problem is the question 
whether or not uniform instruction tends to bring a class up to 
a common level of attainment, or brings about a still further 
differentiation in ability. These questions can be answered best 
by making use of some measure of dispersion. ‘ 

Other uses of group variability appear in connection with 
problems in the overlapping of pupil abilities and in the stand- 
ardization of tests, as a parameter in higher statistical analysis, 
and as a common unit of measure in comparing performances 
on unlike scales. This last use will be discussed in section 7. 


2. MEAN DEVIATION 


One of the simplest measures of dispersion is the mean devia- 
tion or variation of the observations about some central tendency 
such as the arithmetic mean or median. The computation ‘will 
first be illustrated by a short ungrouped series. If x denotes the 
variation of an observation X from the mean M, then x = X— M 
and the original values and deviations for five scores may be 
set down as follows: 


x x 
21 + 4.4 
19 + 2.4 
17 +0.4 
14 — 2.6 
iz = 5 
M = 16.6 14.4=Z || 


It will be observed that the algebraic sum of the deviations 
x is zero in the above problem. This may be generally shown 
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by noting that if X’ = X — A, then ae = M — A, so that the 


sum of the deviations X’ vanishes when M= A. In securing 
a measure of average variation it is therefore necessary to elimi- 
nate the algebraic signs in some way. The mean deviation is 
secured by adding the absolute values of the deviations (disre- 
garding sign) and dividing by their number, or in symbols, 

2 |x| 


Me. — pre {Mean deviation} (12) 
In the illustrative example, A 
M.D.= = = 2.88. 


This simple process becomes lengthy if the mean and devia- 
tions are written to several decimal places, and for this reason a 
shorter method will 
next be introduced. 
The procedure is il- 
lustrated by Fig. 26. 
The above five scores 
are represented by 
the horizontal bars, 
and the deviations 
from the mean by the Fic, 26. Illustrating deviations from the mean 
hatched and dotted for five scores 
portions. Since the 
total negative deviation is equal to the total positive deviation, 
the deviation for the entire series may be obtained by deter- 
mining the total negative deviation and multiplying the result 
by 2. Furthermore, the negative deviation may be found by 
subtracting from the sum of the segments AC and DF the sum 
of the original observations represented by AB and DE. 

The complete procedure may then be described as follows: 

1. Arrange the observations in order of size. 

2. Compute the mean, (16.6). 

3. Count the items smaller than the mean and multiply their 
number by the mean, (2 x 16.6 = 38.2). 
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4. Subtract from this result the sum of the items smaller 
than the mean, (33.2 — 26= 7.2). This is the total negative 
deviation. 

5. Multiply this last result by two, (2 x 7.2=14.4), and 
divide by N to determine M.D., (14.4/5 = 2.88). 

The work through step 4 may be checked by adding-the 
items larger than the mean and subtracting from this sum the 
product of the mean by the number of greater items. 

It is evident graphically that if the same quantity is added to 
or subtracted from each item the deviations remain unchanged. 
This may also be shown algebraically. If A is the quantity 
subtracted we may write 


DO OS 7 \- 

XG Dexa 

Nee oy pants 
so that M'=M-—A, 
and ote 


This simplification of the items is occasionally useful in 
further shortening the calculation. The whole procedure is 
illustrated by another series as shown in Table 16. 


TABLE 16. SHOWING THE CALCULATION OF MEAN DEVIATION 
FOR AN UNGROUPED SERIES 


x X’=X— 110 
ae iy 6 xX 5.182 = 31.09 
0 P 10 | Sum of items — 9. 
119 . ics pipe 9 r larger than ‘ 22.09 
119 aD 9} M’=48 x 2 
117 Ct) 44.18 
ae ol MoD 
112 f Sum of items = Stee 4.02 
6 items smaller | 2 aroaiier than Check: 
aL than M’ 1 [ M’=9 3 p 48 
ae 0 5 X 5.182 = 25.91 
0) 22.09 


5.182 = M’ 
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In determining mean deviation it is theoretically better: to 
take the deviations from the median instead of from the mean 
because, as can be readily demonstrated, the total variation is 
less about the median. The above short method could not have 
been used, however, if the median had been employed, since 
the sum of the deviations about this average is not zero. Further- 
more, for longer series it makes very little difference numeri- 
cally which average is selected. The mean may, therefore, be 
used in ordinary practice. 

In the case of the frequency distribution the same method 
may be used as for ungrouped items, the values of the observa- 
tions being taken at the mid-points of the intervals, that is, at 
class values. The work is illustrated with the following problem: 


CLASs if d fd 
OOOO, 5 o boo 17 3 3 8 x 63 = 504 
S090 sere peda et 2 L 19 2 ie 370 
0230 meee 4 1 4 134 
G00 ae er 5J 0 0 _% 
50=60=9 560. we 4) =i = | 268 
LOSRD= 5 5 wan ae 3 =o = 268 
Oe eee ee -\8 an 15 MD iy ee 
BO-80=", No... ie 7 Checks 
0-2 0 ae 13) —5 =k 890 
20 A 12 x 63 = 756 
134 


——s 


The class values in this example are 15, 25, 35, etc., so that 
the mean is 65 — 3/5 X 10, or 63, by formula (6). There are 8 
frequencies below 63, and 12 above, since the 5 in the interval 
60-70- comes at 65. The product of 8 and 63:gives a result 
equal to the sum of the items smaller than the mean plus their 
deviations from the mean. Next, the sum of the products of 
the smaller class values by their corresponding frequencies is 
55x 4+45x3+15 x 1=370. Subtracting this last result from 
504 furnishes the total negative deviation 134. Multiplying this 
result by 2 to obtain the total positive and negative deviation, 
and dividing by 20 gives M.D. = 13.4. 
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By making use of certain abbreviations, a formula for mean 
deviation may now be set up. Let 


Am =the class value of the interval in which 


M lies, 

N, and N, =the number of observations above and 
below M, 

T, and T;, = the sums of the observations above and 
below M, 


=| fd|,~and Z| fd|, = absolute values of the parts of Zfd above 
and below A,,, and 
h = the width of the class interval. 


The steps in the calculation on page 105 may now be combined 
so as to give the checking formula 


2(T2— NaM) _ 2(N,M — T,) 
Maps 7 ck aha. pean gk 
or M.D. = Ta — Ty — M(Na— No), (13) 


N 


It then remains to find T, and T;, for the frequency distribu- 
tion. These are clearly given by 


T, = NoAm+ (Z| fdla)h 
and T, = NAm — (x | fd |p)h. 


Substituting these values in equation (18) and noting that 
2| fdla+2| fd|,=2| fal, 


= CLfabiee Gea SNe ren fas 


we have 
M.D. = 
N for frequency 


ny : | distribution 
which is the desired result. 


Applying formula (14) to the problem on page 105, we find that 


26 x 10 + (65 — 63)(12 — 8) _ 
20 7 


MM. Dax = = 13.4, as before. 


MEASURES OF DISPERSION 107 


In order to fix the method of calculation and to warn the 
student of the difficulty which arises when A is not taken in the 
interval in which M lies, another model problem is next given 
with complete computations. 


TABLE 17. ILLUSTRATING THE CORRECT AND INCORRECT CALCULATION 
OF MEAN DEVIATION FOR A DISTRIBUTION USING FORMULA (14) 


CoRRECT METHOD INCORRECT METHOD 
Cl 
vane f d fa a’ fa’ 
97-5 a2 3 66 5 110 
92.5 68 Meas 160 2 136 253 4 272 
87.5 51 iL 51 3 153 638 
82.5 28 0 — 2 56 
17.5 47 —1 —47 1 47 
W225 33 —2 — 66 0 0 
67.5 21 —3 — 63 -1 —21 
62.5 9 ted —4 = 815 [Lo i —2 —18 
Pete sale shh mee || e=8 Big oy. 
52.5 2 —6 —12 —4 -—8 
47.5 1 —7 -—7 —5 —5 
42.5 i —8 —8 —6 —6 
N = 289 Zfd= —16 Dfd’ = 562 
Na—N,= 49 2 |fd|= 522 2 | fd’|= 714 
-: UG eee 714 x 5 — 9.723 x 49 
M = 82.5 — “7 0 =O2.220 I a OI ee 
Am—- M=.277 ~ 10.70 
__ 522 X 5 + (6.277)49 
M.D. = 289 
=) 2623.013 
a ag re 9.078 


The work on the left is correct, while that on the right with 
origin at 72.5 is quite wrong. In case it is found that A does not 
lie in the interval containing the mean, this should be adjusted 
at once, using the previous results as a check on the mean. The 
reason for the incorrectness of the method on the right may be 
shown by noting that the expressions for T. and T> on page 
106 give incorrect results in this case. The complete proof is 
left as an exercise for the student. 
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3. THE STANDARD DEVIATION 


In order to introduce the next measure of dispersion we may 
return to the short series shown at the beginning of the preced- 
ing section. A measure of average deviation was there found 
by adding the deviates from the mean regardless of sign. By 
the present method the algebraic signs are eliminated by squar- 
ing the deviations from the mean. 


Ex@ x x2 

21 44.4 19.36 S = 

19 42.4 5.76 S.Di= Vy = \22 

17 + 0.4 AG 

14 = 3.6 6.76 = V10.64 = 3.26 

12 iG 21.16 
= 16.6 Dx? = 53.20 


2 
The quantity =: might now be used as a measure of mean 


square dispersion, but it has been found much more convenient 
and theoretically desirable to take the square root of this aver- 
age. The standard deviation is therefore defined as 


= | x? Standard deviation, 
S.D. = ites { original form i (15) 
The method of calculation for ungrouped series is com- 
paratively simple, but in order to obviate the squaring of deci- 


mals a short cut is usually employed. 
It has been shown in section 2 that 


CS X— MS ae Xe 


Therefore 22 = (X)2 — 2 X'M' + (M")2 
and Da? = D(X’)? — 2 M'(ZX’) + NUM. 
But since NM’ =X’, 
j 2x? a 2X2 \2 
we may write Nesee — (M')2, 


series 


beaLe yon. Standard devia- 
or Sah c= oe aia (M’)®, + tion for reduced >} (16) 
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Applying this formula to the above problem we have 


exe Xa Xara 2 (X")2 

Pall 9 81 

19 4 49 S.D. = V/ 2422 — 21.16 

ie &) 25 =V 10.64 

14 2 4 = sin, 

14 0 0 as before. 
M’ = 4.6 > (X’)2 = 159 


For the frequency distribution the same method is employed. 
Since X’ = dh and M’ = (Zfd)h/N (see Chapter VI, section 2) 
the formula becomes = 

S.D.= Be = (252)"), {vation (17) 
N N distribution i 
the calculation being carried through to the last step in class 
units when the result is then multiplied by the width of the 
class interval h. 

The work will be illustrated by the Otis test data from Table 

13. It is necessary to calculate only one column of items in 


TABLE 18. ILLUSTRATING THE COMPUTATION OF STANDARD DEVIATION 
FOR A DISTRIBUTION WITH CHECK 


Cuiass INTERVAL f d fd fd? a fd’ f(d’)? 
CS eS oe 6 5 30 150 4 24 96 
(NEON 3s Aa oc 2 4 8 , 32 3 6 18 
BORD— G4. Ole Aes ens 3 3 9 27 2 6 12 
5405-59-52 ss 6 2 12 24 1 6 6 
AQ.5—-04.5 2 2 2 2 10 1 10 10 0 _— — 
MANOA O Bian is 6 «| 23 0 = — —1 — 23. 23 
BOSAL SB Sw we 8 —1 _—8 8 —2 —16 32 
BYLAE ED! 5 5 Geo 4 —2 —8 |. 16 —3 —12 36 
Psa 5 2 5 ee 4 -3 —12 36 —A4 —16 64 
ARE: 9 Go A lCne m1: —4 th _16 —5 7=5 _25 

67 37 319 — 30 312 
Qfd | LZfd? Zfd’ | Df(d’)2 


ot = [V/222 — 85? ] x 5 =[)/4.7612 — 3049] x 5 = 2.11 x 5 = 10.55. 
Check: o = vnre ran 39)2] x 5 =[\/4.6567 — .2005] x 5 =2.11 x 5 = 10.55. 


* The standard deviation is frequently symbolized by the small Greek letter o. 
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addition to the computation for the mean. The quantities 
fd? are obtained by multiplying each value of d by the corre- 
sponding fd products. These may be checked by multiplying 
f by d?. Thus, 5 x 30= 150, 4 x 8= 32, ete., or 6 X 25 = 150, 
2x 16 = 32, etc. 

A more complete check may be made by choosing a new origin 
as in the calculation for the mean. If 


a 


¢.="d' 1) 
2. == 1\2 , 
We ak Reach oe ie { Check on } 
and x fd? = Xf’)? +2 Ufd' +N. 1 standard + (18) 
Has | 


In the above problem d=d’+1, so that Xfd? should equal 
Df(d’)2 +2 Zfd' +N. Since 319 = 312 + 2(— 30) + 67, the work 
is checked to this stage in the calculation. The remainder of 
the computation consists in substituting the appropriate values 
in formula (17) as shown in the work under the model problem 
in Table 18. It will be noted that it is desirable to carry the 
work under the radical to four decimal places if the answer be 
required to two. 

Before comparing the above two measures of dispersion and 
noting their uses, another measure of variability will be intro- 
duced. This is known as the semi-inter-quartile range, or more 
briefly, as the quartile deviation. 


4. THE QUARTILE DEVIATION 


This measure of variability is defined as half the range of 
the middle 50 per cent of the observations when arranged in 
order of size or in a frequency distribution. It is only necessary 
to determine two values, Q; and Q3, below and above which one 
quarter of the measures lie. The range Q; to Q3 then includes 
the middle half of the observations and the semi-inter-quartile 
range is defined by the expression 


Q= Ore, {Quartile deviation} (19) 
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In the case of ungrouped material the work may often be done 
by inspection as shown in the accompanying table of total state 
and local per capita expenditures in southern states. Maryland 
is the median state with an expenditure of $6.11 per capita. 


TABLE 19. PER CAPITA EXPENDITURES FOR EDUCATION IN SEVENTEEN 
SOUTHERN STATES 


PER CAPITA EXPENDITURE FOR 
STATH EDUCATION IN 1900 

Oklahomavrerr ek ele). Pe on te eee | $11.94 

Dicthictol Columbian eee eee ane see ane 10.68 

Dela warCmrwe ce tak sere pea. sre nie Pra eV S os 9.02 

Wiestavarginiagg <u cel") curcurticucpier cei oes 8.75 Q3 = $8.58 
AE | as a 2 ee A ee ee eS 8.41 

Ord aman cae ove eceetieecs Glu ee ize 

iGUISiaN a Mowe ks BO er ones Se ats a es 6.65 

Wirginiaeee areca <n onthe fw 0s 6.61 

Vary lander fate iu fy ot) lots) ser ie chase uss Wee 6.11 Qe = Md= $6.11 
INontbeC arolinamermmemen 9) 56 hee an ee ss 5.44 

PN GNHOSSCC uma et ieee lis sie Sea ee is Re be 4.96 

SOUtne Carolin awa gremnaa nie 1-8 cima emen Megat et 4.63 

Arkansas way Meee aire. Gl ieiie; S stls) oe emer Ga oe A562, Qi = $4.59 
PAlabain age Mimo os oo sive ct oak aoe teat fs 4.55 

OTSA Pe te eS oP Sh oe sa oP Sl cer 4.55 

VMississlppilmeers te ie a et) co eda ee beet. Gee 4.54 

Senbick Vila e er rte nee salir oe une am tinieeors 4.36 


Q — $8.58 = $4.59~ _ $8.99" = $2.00. 


The value for Q3 is taken halfway between the expenditures 
for Texas and West Virginia, or at $8.58, and similarly for Q., 
which is $4.59. Q is then half the difference between these two 
results, or $2.00. It will be noted that nine cases lie between Qi 
and Q;, and that this is more than half of the total number 
of items, which is seventeen. For so few items, however, it is 
hardly worth while to strive for a more accurate result, the pur- 
pose of the table being to furnish only rough comparisons. 
The differences Q3 — Q2 = $2.47 and Q2 — Qi = $1.52 are not 
equal to Q, because of the lack of symmetry in the series, but 
their sum is of course equal to 2 Q. With this limitation in mind 


a) 
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the value of Q may be said to furnish approximately the mag- 
nitude which, when laid off on both sides of the median, will 
include the middle half of the items. 

When the data are in a distribution, the values for Q; and Q3 
are computed in the same way as the median, the only difference 
being that one quarter instead of one half of the observations 
are counted in from either end. The procedure may be illus- 
trated for the following distribution of intelligence quotients. 
These data are taken from a survey made in several counties in 
Illinois, the results of the study being as yet unpublished. 


TABLE 20. ILLUSTRATING THE COMPUTATION OF QUARTILE DEVIATION 
FOR A DISTRIBUTION 


1.Q. f 

150-160- 2 

140-150- 12 

130-140- 36 } fu Qs = 110 — 1208.5 = 471 , 19 — 100.770 

120-130- 103 799 

110-120- 318 

100-110 799 = fs Qi = 70 + 1208-5 — 563 y 49 = 77.437 
90-100- 1074 868 
80-90- 1059 
70-80— 868 = fi Qs —Q:1= 23.338 
60—70— 366 . 
eee > es . Q= 11.67 
30-40- 9 


4834 


Q3 and Q; may be computed most readily from formulas sim- 
ilar to those used for the median. If one quarter of the cases 
be counted in from either end of the distribution the formulas 
become 


N 
4a = fao 
Qs =u —-2—— xh (20a) 
Quartiles 
NS he for ees 
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where f; and f; are the frequencies of the intervals where Q; and 
Q3 occur and the other symbols are used as in the formulas for 
the median. The calculation is shown in full at the right of the 
distribution. A check may be made by counting in three quar- 
ters of the way from either end of the distribution and using. 
similar formulas. 


5. COMPARISON OF MEASURES OF DISPERSION 


In order to bring together the quantitative methods discussed 
thus far, all the simple averages and measures of dispersion 
have been computed for the above distribution and located 
graphically on a histogram. The student should work out and 
verify the following results: 


Mean = 89.28 M.D. = 18.65 
Median = 89.31 SD, = Weeds 
Crude mode = 95.00 Bos Gna yl 


The close agreement of the mean and median would seem to 
indicate a high degree of symmetry in the distribution, but con- 
trary to expectation the data do not furnish a good example of a 
normal probability curve as will be shown in Chapter XIII. 

As illustrated by Fig. 27 a range of 2 Q includes the middle 
50 per cent of the observations, a range of 2 M.D. (from the 
mean) somewhat more than half of the cases, while a range 
of 2 S.D. includes about two thirds of the items. Furthermore, 
the ratio of Q to S.D. is approximately .69, while the ratio of 
M.D. to S.D. is .81. These are typical of the results found with 
fairly symmetrical distributions. For the normal probability 
curve these two ratios are .6745 and .7979 respectively (Chap- 
ter XII). 

By laying off the standard deviation three times to the left 
and to the right of the mean, a range of 6 S.D. from 38.70 to 
139.86 is obtained. By referring to Table 20 for the frequencies 
it will be noted that about 4811 cases, or 99 per cent of all 
the observations, lie within this range. For distributions of this 
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type, then, deviations greater than 3 S. D. from the mean occur 
very infrequently. Similarly, a range of 75 M. D. will extend from 
38.09 to 140.47, while a range of 9 Q (laid off from the median) 
runs from 36.80 to 141.82. Within all three of the above ranges, 
therefore, more than 99 per cent of the cases will ordinarily occur. 


Frequency 


7 | (80. 7 90 {100, iS 120 1380 140 
| M.& Ma. soe | | 1.Q. 


ha 
Penis: 
ane 


Fic. 27. Illustrating the comparative magnitude of several measures 
of dispersion 


As regards clear definition there is little choice between the 
three measures of dispersion when the data are arranged in a 
frequency distribution. For undistributed series, however, the 
quartile deviation has the same defects as the median. As illus- 
trated in Table 19, it is sometimes necessary to take the average 
of two neighboring values for Q, or Q3. 
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The algebraic properties of the standard deviation make it 
the most useful for combining the results from several series, 
and in connection with other statistical formulas. Thus, if two 
series of size Ni and Nz have a total population of N= N+ No, 
with means M; and Mp, and standard deviations o; and oz, it is 
possible from these values to find the mean M, and the standard 
deviation a, of the whole group. 

It has already been shown in Chapter VI that 


ua Mim £ N2My_ 


The standard deviation of the total series may also be found. 
From the proof in section 3 it is apparent that if the assumed 
mean A be taken equal to M for both series, 
M,—M=( 
and M = Ve Co. 
The mean square variations of the component series about M are, 
. HAND) / 2 
by equation (16), 26K)? = 02+ C2 and 2(X'2)? = go? + Co? 
Ni No i 
respectively. The total square variation, or 2 X?, of both groups 
about M is therefore 
D(X'1)2 + Z(X"2)? = Ni(oi? + Ci?) + No(o2? + C2”), 
or No? = Ni(o12 + C1”) + No(o2? + Co”), (21) 
and in case the component series are of equal size we may write 


o? = 4 (012 + 052). 


The quartile deviation is probably the easiest measure of 
variability to compute, the mean deviation next, and the stand- 
ard deviation most laborious of all. Simplicity of calculation, 
however, should rarely determine which measure of dispersion 
to employ since other properties are much more important. 

The standard deviation is, in general, less affected by fluctua- 
tions in sampling than Q or M.D., and for this reason alone is 
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preferable to the others. It is sometimes argued that the pres- 
ence of a few extremely large or small observations may affect 
the standard deviation unduly, but if such items are truly a 
part of the distribution this objection is overruled. 

In view of all of the above properties the standard deviation 
is the best measure of variability to employ for the fairly sym- 
metrical distributions ordinarily found with educational or 
psychological data. A fairly safe rule with such material is to 
use the mean and the standard deviation whenever the data 
warrant careful treatment, reserving the median and Q for 
rough work with small samples. 


6. THE COEFFICIENT OF VARIATION 


The measures of variability discussed thus far have two 
properties that are at once apparent. 

1. They are expressed in the units of the variable so that 
direct comparisons of dispersion can be made only between 
series on the same scale. 

2. They depend upon the size of the deviations from some 
central tendency, but are quite independent of the magnitude 
of the average itself. ’ 

A measure of variability which is independent of the scale 
units and which takes into account the size of the deviations rel- 


, 


ative to the mean may be expressed in the form \2(2)/ N 
which reduces at once to aaa Professor Pearson has called this 


quantity (when multiplied by 100 for convenience) the coefficient 
of variation, or percentage ratio of the standard deviation to the 
arithmetic mean. Denoting this new measure of variation by 
V, we have a 100¢ 
aed, 


- {Coefficient of variation} (22) 


It should be noted that while o is the standard deviation of 
X, V is the standard deviation of 100 X/M. The student who 
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has difficulty in visualizing the significance of the coefficient of 
variation may thus regard it as the dispersion found when all of 
the observations (or deviations from the mean) have been made 
comparable by dividing each by M/100. 

Direct comparisons of measures of absolute variability such 
as standard deviation, and relative variability as given by the 
coefficient of variation, often lead to confusion. Both are root 
mean square measures of variability, but of quite different 
things as shown above. 

A simple example may illustrate this point: 


M,= 20 problems, o1;=4 problems, .. Vi = 20 
M,= 40 problems, o2=4 problems, .. V,=10 


These two series are equally variable as to absolute disper- 
sion, but the relative variability in the first group is twice that 
in the second. Both measures are entirely correct, although it 
has been argued by Franzen * that the coefficient of variation 
should not be used with such material because of the arbitrary 
nature of the zero point on educational tests and scales. This 
amounts to objecting to the coefficient of variation because the 
size of the mean is arbitrary, but on the same grounds we should 
object to the use of the mean itself. sa! 

The chief use of the coefficient of variation is id comparing the 

a dispersion of series where the means differ considerably in size and 
where the variation relative to the mean is therefore important.| 

The following distributions (p. 118) give the per capita state 
and local expenditures of forty-nine states (including the Dis- 
trict of Columbia) for elementary and secondary education, 
and for all purposes in 1920. 

If the standard deviation had been employed in comparing 
the variability of these two groups, it would have appeared that 
there is much more uniformity among the states in educational 
expenditure than in total expenditures. Using the coefficient 


*Raymond Franzen, ‘Statistical Issues,” Journal of Educational Psychology, 
September, 1924, p. 381. 
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TABLE 21. PER CAPITA EXPEND- TABLE 22. PER CAPITA EXPEND- 
ITURE FOR EDUCATION IN 49 ITURE FOR ALL PURPOSES IN 49 
STATES * STATES * 
EXPENDITURE f EXPENDITURE if 
Sk APPR” 5 oe a a ao 1 $100=$1:09-997, | ae ee 1 
S9= $9059 9 Meee eee ae 2 $90=$99{99 ee eee ee ee - 
S—=—> S299 Tae eee ne 3 $80=599:99 6 ee tee ee 1 
SU = 1 G29 9 ae eet abaoe oor teret are e 6 S10 879 99 eee cence af 
CIE ab oo 4 $60—569:99 i we oe ae 8 
SS 10°09 aia te nee eee 6 $50-$59.99  . .. 2s - «© fe 
$9=$1'0.99 = om oi eget ars <9 $40=349:99 Rae ene 6 
T= 68590) a. eee ee ee tf $30=$39:99 9s ee eee 12 
$5=$6:99-% on a eee 3 $20=329:909)) eee eee 6 
SO 58499 ce ok eee ee 8 $10-$19.99/) ) kee oa ee, 7 
otal stoke ee eee ae 49 Total yay orto ae ee 49 

M = $10.94 M = $43.16 

Ges Seay o = $20.07 

Vi=445 V = 46.5 


of variation, however, we find little difference in relative dis- 
persion. While both results are correct for some purposes, the 
latter gives the better measure of the relative extent to which 
these two types of expenditure have become stabilized. A varia- 
tion of a dollar in the first group is comparable with a variation 
not of one but of about four dollars in the second series. For 
such problems the relative rather than the absolute dispersion 
should be used to show the degree of uniformity in expenditure. 


7. COMPARABLE MEASUREMENTS 


One of the most important uses of variability is in furnishing 
units for the comparison of measurements on unlike scales. Be- 
cause of its algebraic nature, the standard deviation is the most 
useful for this purpose. The standard scores on tests X1, Xo, X3--- 
may then be defined as the deviations from the mean divided by 


the respective standard deviations, or —!; ~2, —8.... Like the 
01 Og G3 


* Adapted from Miss Newcomer’s figures in ‘Financial Statisties of Public 
Education in the United States, 1910-1920.” The Macmillan Company, 1924. 
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coefficient of variation these scores are clearly abstract num- 
bers, since they result from dividing a denominate number by a 
quantity of the same denomination. 

It will be noted that the standard score of a pupil gives his 
relative position in the group in terms of a number of standard 
deviations above or below the mean. Thus, if the raw score 
be 120 with M=90 and o=10, the standard score will be 
120 — 90 

10 
M = 12 and o = 2, his standard score will again be +3. His 
relative position in the distributions of both tests is clearly the 
same as shown by the standard scores. 

Being abstract numbers, standard scores on several tests may 
be combined by addition. The only caution that needs to be 
observed is that the various distributions from which the original 
scores are taken for comparison shall be of the same general 
shape. For a very skewed distribution an observation one S. D. 
above the mean of the series is not comparable with a meas- 
urement one S.D. above the mean of a symmetrical group. 

In order to illustrate the use of standard scores the following 
data resulting from seven different tests are presented : 


=+ 3. If on another test this pupil scores 18 with 


TABLE 23. STANDARD SCORES OF A PUPIL ON SEVERAL TESTS 


X= 

TEST MEAN S.D. SCORES OF | x= X—M* a 

A PUPIL @ 
it 5. oe 6h. be ee Oe ee 163 10.2 179 +16 + 1.57 
Cs ee es A Say bs 119 8.1 128 119) +1.11 
RIE Re te ke es 24 6.0 28 +4 + 0.67 
Ce Gic-ck O a ee 264 39.8 312 + 48 + 1.21 
ORR ey Ree Kiernan. ai bh fay ea 74 8.2 89 +15 + 1.83 
Shit), Gene OL tet. Scart eee G3 AAvll 6 —1.3 — 0.62 
Roi ace tA ue ce ea 133 16.4 151 +18 +1.10 
AGEL 4° 5... -0'- Bie Ameo. MGh Cece clio Reales nce A 893 6.87 
Meanie ee oo cto. aD Re ee 127.6 0.98 


* The deviations x = X — M are first computed and then each is divided by o as 
shown in the last column, 
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If a composite score of the seven tests is desired, it would 
not appear correct to add the scores on the separate tests, be- 
cause they are in unlike units and undue weight would be given 
to extreme scores. The objection that unlike quantities should 
not be added is not a serious one because even horses, pigs, cows, 
and sheep may be added together to secure the total number of 
farmyard animals. This amounts, of course, to broadening the 
unit so as to include all items in the sub-classes of the total 
group. The objection against the extreme weighting of some 
scores may be more important, for a score of 6 on one test may 
represent a mental effort as serious as a score of 179 on another 
scale. Both of the above difficulties are overcome when stand- 
ard scores are used, the only trouble being the amount of arith- 
metic involved and the presence of positive and negative scores. 

For very careful work the best method for comparing meas- 
urements and forming composites is through the use of standard 
scores. Test scores are far from stable, however, and great pre- 
cision in their treatment is not always desirable or necessary. 
In many composite tests the components may be added in the 
unweighted form with practically as good results as by the 
standard score or other methods of weighting. This is illus- 
trated by the Terman Group Intelligence Test consisting of ten 
parts. The simple total of all points made was found to agree 
(correlate) almost perfectly with the composite formed by 
weighting each of the separate tests and adding them. There is 
considerable disagreement, of course, in the case of some scores, 
but when fifty to one hundred cases are taken these individual 
differences have little effect upon the net result, especially when 
the number of test items is fairly large and they are not ex- 
tremely uneven in weighted value. 

Aside from the question of precision it may be important to 
represent scores in the standard form for the purpose of clearer 
interpretation. By a very simple formula based on standard 
scores it is possible to transmute the results on any number of 
tests so that they all have the same mean and standard devia- 
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tion. This method, which is quite old,* is frequently rediscov- 
ered and appears from time to time in a slightly different form 
in psychological and educational journals. 

Let X; and Xp» represent the scores on two tests expressed 
in any units, M; and M, the respective means, and o; and a the 
corresponding standard deviations. We may now write 


Ly Lo 
ees 
O1 02 
o 
or ty = — Xp. 
02 


Since « = X — M, this may also be expressed in the form 


o1 Transmutation formula | 
X; = M; + — (Xo — Mz). ¢{ for comparable scores,+ (28) 
o2 1 score form J 
This is the desired transformation which, when applied to Xo, 
makes its mean and standard deviation equal to those of Xj. 
These properties are apparent from the preceding equation. 
Hence by applying this formula to each item in the series we 
may, without affecting the relative position of any value, change 
the series so that it will have any mean and standard deviation 
desired. 

As an example we may select M; = 50 and o; = 10, these 
being convenient numbers. By the application of the above 
transformation to any number of tests, they may be brought 
into direct comparison with the one selected as standard. Thus 
the series of X scores shown below may be transmuted into 
comparable T scores} by the relation 


Tae we (x —8), 
oF T = 24.02 + 8.66 X. 


* Galton introduced comparable measures in the form of deviations from the 
median divided by the semi-inter-quartile range. 

+ These are similar to McCall’s T-Scores. See William McCall, How to Measure 
in Education. The Macmillan Company, 1922. See also Chapter XII, section 8, of 
the present text. 
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x f of ie iF. 
5 10 67.32 10 
4 20 M,=3 58.66 20 
3 30 So 50.00 30 
2 20 ST ae 1.155 41.34 20 
1 10 32.68 10 

90 90 


The distribution of T scores obviously has a mean of 50 and 
a standard deviation of 10. By similar transformations any 
number of series will have these same properties, so that the 
scores on all tests may be brought into direct comparison. 
Thus a score of 50 will always represent the performance of an 
individual at the mean, while 30 will represent the score of a 
person two standard deviations below the mean, etc. If such 
a scaling method were adopted it should, of course, be ap- 
plied only to large groups of unselected children at different 
ages or grades. After the T scores have been worked out for 
the different tests, transmutation tables should be prepared so 
that any X score can be easily transformed into the correspond- 
ing T score. 


8. THE MEASUREMENT OF SKEWNESS 


Whenever it becomes necessary to compare several distri- 
butions of varying degrees of asymmetry or skewness, some 
numerical measure of this property becomes desirable. Such 
a measure of skewness should be independent of the unit of 
measurement for the variable of the distribution. Thus for a 
distribution of heights, a representation of skewness is needed 
which will remain unchanged whether the measurements be 
made in inches or in centimeters. 

One such measure may be obtained by the formula 


S.= (Qs — Md) — (Md — Qi) Measure of 
Q skewness (24) 


_ Qi + Os — 2 Md based on 
=e a () a . quartiles 
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The skewness will thus be positive when the longer tail of 
the distribution is in the direction of the high values of the 
variable as shown in Fig. 28. 

The lowest value given 
by (24) is clearly zero when 
the distribution is symmet- 
rical. While a maximum 
value of 2 may be obtained 
with the formula, in actual 
practice results beyond the 
limits + 1 are rare. 

A better measure of skewness is given by Pearson’s formula, 


Fic. 28. A positively skewed 
distribution 


_ _M—M. ff Pearson’s measures 
he hi. of skewness \ (25) 
which also gives positive values for distributions of the type 
shown in Fig. 28. Owing to the fact that the true mode, M,, is 
very difficult to determine, this formula may be replaced by 
another expression in which an approximate value for M, is 
employed. Pearson has shown that for moderately skewed dis- 
tributions, the relation between mode, mean, and median is 
iven b 
Seer ae M, = M—3(M — Mad). 
Substituting this value for M, in equation (25) we find 


= 3(M — Md) | { Approximate meas- 


Sr Co ure of skewness } @6) 


As an example we may work out the degree of skewness in the 
distribution of I.Q.’s of Table 20, using formulas (24) and (26). 
Using (24), 

eae (100.77 — 89.31) — (89.31 — 77.487) _ 085. 


11.67 
Using (26), 


_8(89.28 — 89.31) _ 
eee essa 0053. 


For this distribution the skewness, measured by either for 
mula, is negative and slight. 
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EXERCISES 


1. Caleulate the mean deviation and the standard deviation for 
the following scores (ungrouped): 166, 159, 158, 151, 150, 142, 131, 
NAG, Tass, OL. (VM Dy=1T 02 eS ea 1 eee 


2. Compute the standard deviations for the frequency distribu- 
tions of the data of Exercise 3, Chapter II. 
(06 = 19.9 s906=10-52 6; 25-0. Ane) 
3. Calculate the mean deviation, standard deviation, and quartile 
deviation for each of the problems of Exercise 1, Chapter VI. 


1 2 3 4 5 6 ff 8 9 
M.D 28. 2) 62850 [2.79 1 AO) 439) 16.6346 65.25, 1) Essaie as 
eee Sees 8.74 | 3.61 | 1.84 | 15.08 | 21.52 | 5.67 | 7.06 | 1.67 | 1.76 
EN ce gate 5.94 | 2.125 | 1.125 | 7.875) 11.29 | 3.85 | 4.31 | 1.03 | 1.30 


Ans. 


4. Calculate the coefficients of variation for the following dis- 
tributions: 


HiIGH-ScHOOL 
ENGLISH TEACHERS 


HiGH-ScHOOL 


MONTHLY SALARY IN 1914 
ScIENCE TEACHERS 


HSI) 56 6 So oo a 
TEC y ee a aan ob a ap ot 
WWASSWWAD EES 6-5 5 be 6 6 
WAVY) 2 5 Ge a a EO 
NBS ee ees 6 loo Soe 
eS fan oS 5 6 oO Hee 
1 0.5=1'09:99 5 eras at eenist sone ete 
HOA 5 oo gb to oo SSS 


OC) yee ety ar Ry Le ce 


Giles en! [lal 


cy CT ree tie oP a wep ag TS 


SE MO) aie ce ie) eee a se 


Peete ere «ee i) oe em he ty 


Science: M = $94.83, o = $15.8, V = 16.7 
English: M = $77.67, o = $10.9, V = 14.0 
Ans. 
The V’s are more nearly alike than the o’s. Explain. 
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5. Verify the results in the following table: 


COMPARISON OF FOREIGN-BORN GROUPS FOR DIFFERENT NUMBERS OF 

YEARS IN THE UNITED STATES IN TERMS OF THEORETICAL COMBINED 

SCALE OF INTELLIGENCE (ALPHA, BETA, AND ALL INDIVIDUAL EXAMINA- 
TIONS COMBINED)* 


(Intervals are to be taken as 22-23 with class value 22.5 etc.) 


YEARS IN UNITED STATES 
COMBINED SCALE TOTAL 
0-5 6-10 11-15 16-20 Over 20 
CA reese ret eA 1.0 0.4 1.0 0.4 0.5 3.3 
VAL ee BR Ree 2.8 2.8 2.4 13 3.6 13.2 
0 ScPaiah stone: va, <6 5.8 8.15 6.2 3.83 TA 31.33 
Oye it cs! 14.0 18.5 12.98 8.22 PAI 65.8 
US meer te: Dilee 38.1 27.83 17.94 24.78 136.45 
SIC” ae Geen ee 55.5 Hipast 52.24 32.68 41.11 254.23 
DGiast stat iat. 104.4 142.4 88.51 59.62 66.18 461.11 
HL Dek sa esas 172.5 240.7 139.85 76.99 86.44 716.48 
DS oe eee 265.3 355.2 199.78 115.24 106.35 1,041.87 
UD rst tte ay vs 368.8 490.1 273.95 127.44 iielels 1,387.4 
LD a topesr cy be 441.2 597.0 808.72 119.86 113.13 1,579.91 
‘ltly Geet ee oes 461.5 596.9 247.31 86.62 69.95 1,462.28 
HL OE tee tcheetes 470.9 529.9 189.02 50.89 44,34 1,284.55 
uae, rhe 454.3 474.7 150.88 27.54 28.48 1,135.9 
Oa he ees ihe 342.5 347.4 100.32 17.08 YL 825.07 
HL OSs Ne a a a PAA 207.8 57.38 jaye 9229) 494.69 
CAN siecbie! s 106.8 101.6 26.58 3.92 3.20 242.15 
LSS: Ane ae te ee 44.8 37.2 10.02 1.45 .86 94.33 
Ch ooae Cuaron 16.4 14.5 3.74 -50 25 35.39 
Gy 5 Gees 4.7 4.3 1.03 —_ — 10.03 
YE dee a a ae 5 1.3 232 — — 3.12 
OS oe ae te eee A 33 _— — _— AE 
Motalges to.) 1D-0 4,281.9 1,900.06 758.84 762.89 | 11,279.29 
First quartile . 9.36 9.75 10.66 11.94 12.15 9.98 
Median ~~. . V129 ala 12.55 esol 13.74 12.03 
Third quartile . 13.34 13.61 14.28 15,15 15.59 13.93 
Quartile devia- 
tioneee ae oe 1.99 1.93 1.81 1.61 Ls 1.98 


6. Work out the standard scores for the first five pupils on the three 
intelligence tests of Exercise 1, Chapter II, using the means and 
standard deviations already calculated. 


Otis: 1.59 1.49 — .57 .09 — 1.67 
Chicago: — .17 207. el — .74 — 1.36 
Terman: — .32 1.21 .28 — .83 — 2.28 Ans. 


* Data from Memoirs of the National Academy of Sciences, Vol. XV, p. 704. 
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7. Convert the following distribution into a series having mean 
= 85 and S.D. = V30. 


GIVEN DISTRIBUTION TRANSFORMED DISTRIBUTION 

Class Value, X f rT: f 
80 1 96.62 1 

70 2 92.75 Z 

60 3 88.88 3 

50 8 85.00 8 

40 3 cube) 3 

30 2 17.26 2 

20 i 73.39 1 

20 20 


Transformation equation is T = .8873 X + 65.64. 


8. Derive formula (17) from (16). 


CHAPTER VIII 
THE PERCENTILE METHOD 
1. INTRODUCTORY 


There is nothing essentially new in the method of percentiles, 
but the recent wide use of percentile scores, ranks, and curves 
in dealing with test data warrants a somewhat detailed account 
of this method. 

It is hardly worth while to apply the percentile method in any 
form unless the data are sufficient in number to justify their 
representation in a frequency distribution. Percentiles are ob- 
tained in the same way as the median and quartile values which, 
as we have seen, are not well defined in the case of ungrouped 
items. Furthermore, the irregular nature of short series makes 
the percentile values unstable and of little practical significance. 
For these reasons the method will be discussed only in connec- 
tion with frequency distributions. 


2. PERCENTILES 


A percentile is a value of the variable below which a given 
per cent of the frequencies lie. These values may be denoted 
by the symbol P,, where the subscript p refers to the percentage 
of observations smaller than P,. Thus Pio, P25, Ps0, and Pse 
are values such that 10, 25, 50, and 82 per cent of the cases 
lie below them. 

From this definition it is apparent that the median is equal 
to Pso and that the quartile values Q; and Q3 are equal respec- 
tively to Pes and P7s. . 

Formulas for the computation of percentile values may now 


be set up in a form similar to those used for the median. 
127 
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Let p = the percentage of the cases smaller than P,, 
fp = the frequency of the class where P, occurs, 
fup and fa =the frequency up to and down to the interval 
containing the required percentile, . 
y.l. and l.l. = the upper and lower limits of this interval, and 
h and N = the size of the interval and sample as before. 
The formulas then become 


Pies Sup 
100 Percentiles, 
Ppa. + |= See ea 
100 — p 
and Fe 2 | ATO) 2S | Sarcoma 
Dry aEs| crete Ser er rnr rng chest rere « (27b) 
TABLE 24. ILLUSTRATING THE COMPUTATION OF PERCENTILES 
AGE RECEIVED Pu. D. 7 
45 3 
a a4 
43 3 
42 3 
41 1 
40 5 
39 9 
38 5 PN — 20 x 400 = 80° 
oy : To x 400 = 80 
36 7 = 80 — 54 
ae 7 +308 = fao Poo = 24.5 + 38 cillien 
34 10 = 24.5 + .684 
33 13 = 25.184 
32 17 
81 29 Check: 
30 42 Pao = 25.5 — 320 = 308 ,. y 
29 31 38 
27 37 = 25.184 
26 54 J 
25 38-— fe 
24 29) 
23 14 | 
22 7-54 = 
= é fup 
Bo ah 
400 
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In order to illustrate the use of formulas (27 a) and (27b), the 
complete calculation for P29 is given in Table 24. It should be 
noted that the ages are given at class values, the intervals being 
44.5-45.5, 48.5-44.5, ete. The check should be used until the 
student is confident of the accuracy of his calculations. 

By similar computations, the values for P10, P20, up to Poo 
may be obtained and set down as follows: 


Pio = 24.02, P = 26.88, Pio — 30.43 
Pop Zor, Psq= 28.18, Po = 31.97 
P30 = 26.02, Peo = ya ee Poo — 35.64 


These percentiles divide the series into ten equal parts so that 
a given age may be readily located in any part of the distribu- 
tion. Thus if a man received his Ph.D. at twenty-six, it is 
at once apparent that 30 per cent of the men were younger 
than he when they took this degree. Similarly, a man who re- 
ceived the degree at thirty-two was among the oldest fifth of 
the entire group. 

The above method for obtaining percentile values is the most 
direct and accurate one. The same results may be obtained 
more easily, however, by making use of the cumulative fre- 
quency curve as described in Chapter IJ. The computation in 
this case is graphical and the accuracy of the results will depend 
upon the construction and use of the drawing. When adding in 
from the lower end of the series, the cumulative frequency dis- 
tribution for ages may be arranged as shown in Table 25 on 
page’ 130. 

The plot of these data is shown in the cumulative frequency 
curve of Fig. 29 on page 131. The p scale on the right is made 
by dividing the total cumulative frequency scale into 100 equal 
parts. In order to obtain any percentile value graphically it is 
only necessary to find the required percentile index p, on the 
» scale, move to the left from this point until the curve is 
reached, and then drop down vertically to the percentile value 
on the horizontal scale. 
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TABLE 25. CUMULATIVE FREQUENCY DISTRIBUTION FOR DATA 
OF TABLE 24 


AGE FREQUENCY LESS THAN GIVEN AGE 
45.5 400 
44.5 397 
43.5 397 
42.5 394 
41.5 391 
40.5 ‘ 390 
39.5 385 
38.5 376 
87.5 Sit 
36.5 366 
35.5 aay, 
34.5 352 
33.5 342 
32.0 329 
31.5 , 312 
30.5 283 
29.5 241 
28.5 210 
21.5 183 
26.5 146 
25.9 92 
24.5 54 
23.5 25 
22.5 11 
PANS 4 
20.5 2 


Fig. 29 has been drawn in the form of a polygon, consist- 
ing of straight lines between the cumulative frequency points. 
While it is sometimes legitimate to smooth in the points by a 
free-hand or fitted curve, the student had better confine him- 
self to the use of the polygon until he has made a special study 
of the subject of smoothing. 

Although greater precision may be obtained by the use of the 
direct method of computing percentile values, the equivalence 
of the two procedures may be readily seen. Because of the 
manner in which the cumulative frequency curve is constructed 
the value of the ordinate gives the total frequency below the 
corresponding abscissa. Thus 54 frequencies lie below 24.5, 
and 92 frequencies lie below 25.5. By joining these points with 
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100 


0 
20.5 24.5 40.5 
5 42 


28.5 382.6 36.6 44.5 Age 
26.5 30.5 34.5 388.5 5 


Fic. 29. Cumulative frequency curve for ages 
at which Ph.D.’s were received 
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a straight line as in 
Fig. 30, it is assumed 
that the increase in 
cumulative frequen- 
cies AB is directly 
proportional to the 
part of the interval 
up to the correspond- 
ing point P on the 
horizontal scale. In 
this case the fre- 
quency at P is 20 
per cent of the ob- 
servations, or 80, so 
that AB=26- The 
value of x is there- 
fore found from the 


proportion 
i OS 
Aa = .68. 


Adding this result 


to 24.5, the lower limit of the interval, gives 25.18, or exactly 
the same result as by the direct method of calculation. 


3. PERCENTILE CURVES 


A percentile curve may 
be made by plotting the 
series of values such as 
those worked out in sec- 

Y tion 2.{The ordinate is 
the value of the percentile/ 
y fawhile the abscissa is the 


Fic. 30. Enlargement of a portion of the 
cumulative frequency curve to illustrate 
the calculation of P2o 


percentile index. Such curves should be distinguished from 
cumulative frequency curves where integrated frequency is 
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represented by the ordi- 


Age 

40 nate and values of the 
85 variable by the abscissa. 
30 Both of these are often 
25 called percentile curves, 


but it is more in harmony 

with mathematical con- 

vention to name a curve 

according to what is rep- 

resented by the ordinate, 

0 10 20 80 40 60 6 7 80 9 p and this is the basis for 
Fic. 31. Percentile curve for Ph.D. data the above distinction, 

Inspection of Figs. 29 

and 81 shows that the two curves are essentially different in 

form, one being reversed in curvature from the other. i The 

percentile curve |has been termed by Francis Galton an ogive. 


“Oop (10 <9 2 Sse 20°" Bop tae CONDO Reoe F508 200 
Fic. 32. Percentile or ogive curve for Ph.D. data 


Another method for constructing such ogives is by means of 
the cumulative frequency distribution.* The difference between 
this method and the one just shown is that values of the ends 


* This method is used by Otis in connection with the tests published by the 
World Book Company. 
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of the class intervals are now plotted over computed p-scale 
values instead of finding the percentiles at given p values, 10, 
20, 30, etc., and plotting them. 

The calculation is very simple, consisting of a series of cumu- 
lative frequency percentages for the values of p at the ends of 
the intervals. In Table 26 these results are set forth for the 
Ph.D. data. The computation may be done most readily by 
setting the reciprocal of 400 in the calculating machine and 
multiplying it into the series of cumulative frequencies. The 
curve as shown in Fig. 32 is now constructed by plotting the 
values from Table 26 and connecting the resulting points. These 
points are indicated by dots, while those previously calculated 


TABLE 26. ILLUSTRATING CUMULATIVE FREQUENCY PERCENTAGES 


= PERCENTAGE 
AGE jer A Sel PaaotENaT Liss | =f y 100 
THAN GIVEN AGE N 

AUS ta) aay ole, Roan en aes 400 100.0 
AAR aie Seah srs 8s ass 397 99.3 
CUB 1G. chau nnn, aces 397 99.3 
ema E> Lait 0 el 894 98.5 
AE REG So rst oe) vas on al ve 391 97.8 
GSS “ack Gy eC ata 390 97.5 
SiH) 5 he Bt RaREeOe neeee 385 96.3 
SES Vo id eel Sees 376 94.0 
SYS Get ch cee eee OMrE REE 871 92.8 
SOM 45: Oke See eee 366 91.5 
SER: o Gac Remo ces 859 89.8 
At DEP eee ae eer as 352 88.0 
GB) Vega ee 842 85.5 
SOMBRE Aches chs 829 82.3 
SHIASY i, 13. Ue PEA ce 312 78.0 
BH 6 oak Ge ome 283 70.8 
De! se 8 chise GO eae 241 60.3 
CAS Paes ay eee 210 52.5 
OAS Se Ce een eG ae ee 183 45.8 
OARS Cais. Ge, ean 146 86.5 
Dies. oy LA etl sue 92 23.0 
OARS reat aa tei ote 54 13.5 
DE at ea eee ee 25 6.3 
DOM GENES dad 8 haere a. 11 2.8 
ile Maree oa <ul cok ett ogechs 4 1.0 


PADS ec? 8. Belen 8 pe 2 0.5 


$$ $$ S$. $s 
r re 
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from the simple distribution are given for comparison by small 
crosses. It is obvious that these latter values could have been 
obtained graphically by making use of the ogive formed by 
the dots. 


4. USE OF PERCENTILE CURVES 


A review of the foregoing paragraphs will show that percentile 
values may be calculated in three ways. They may be computed 
numerically from formulas (27a) and (27b), making use of the 
simple frequency distribution ; they may be calculated graphi- 
cally from the cumulative frequency curve; and finally they 
may be obtained graphically from the ogive, or percentile curve. 

The particular method to be used depends upon the adequacy 
of the data, the number of percentiles required, and the accuracy 
needed. Unless the data are fairly plentiful (one hundred or 
more cases), the graphical methods are usually not as expedi- 
tious as the use of the formulas. Furthermore, if only the median 
and quartiles are required, it does not pay to throw the data into 
cumulative or ogive form in order to obtain them. Finally, if 
considerable accuracy be needed in the result, the numerical 
method is far superior to the others. 

In case a number of percentiles are required and the data are 
sufficient in number, either of the above graphical methods may 
be employed to advantage, where only fairly accurate results are 
_needed. If the total number of cases gives a convenient quotient 

when divided by 100, the p scale of the cumulative curve may 
be readily constructed, and this method is probably the better 
to use. For most problems, however, the total frequency is an 
awkward number such as 371, so that the graphical construc- 
tion of the p scale becomes difficult. It is therefore usually best 
in constructing both the cumulative frequency curve and the 
ogive, to use the percentage frequencies as shown in Table 26. 

Another use of percentile curves is in the comparison of two 
series. As an example, one of Otis’s graphs is shown in Fig. 33. 
The two curves shown have been smoothed free-hand, but as 
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already indicated the subject of smoothing is beyond a course 
such as this, and the student will ordinarily do better to take 
the data at their face value in drawing such percentile curves. 


Grade 
Clase’ 


Mental Ability Test 
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Fic. 33. Illustrating the method of drawing a percentile curve* 


5 


Otis{ points out that the scores in Grade 5B are appreciably 
‘higher than those of Grade 4B, but that on the whole the dis- 


* From Arthur S. Otis, Statistical Method in Educational Measurement. World 
Book Company, Yonkers-on-Hudson, New York, 1925. 
_ A. S. Otis, Statistical Method in Educational Measurement, p. 87. World 
Book Company, 1925. 
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tributions of scores of the two grades overlap very markedly. 
He goes on to show that ‘ta convenient way to express the over- 
lapping is to state the per cent of scores of Grade 4B that exceed 
the median score of Grade 5B, or to state the per cent of scores 
in Grade 5B that fall below the median score of Grade 4B. 
Thus, by finding the point on the 4B curve having a height 
representing a score of 37 (the median score of Grade 5B), we 
find that the upper 17 per cent of the scores on Grade 4B, as 
indicated by the curve, are above the median score of Grade 5B. 
The dotted lines indicate the solution.’”’ Otis also shows that 
such curves are convenient for finding and comparing various 
percentile values. Thus the pupils at the 10 and 90 percentiles 
in the two groups differ less widely than do the corresponding 
median pupils. This is shown by the vertical distance between 
the curves. There is, however, a certain amount of optical illu- 
sion in such comparisons which makes the curves appear to 
bulge apart in the middle. 


5. PERCENTILE RANKS 


The percentile curve is also useful in determining graphically 
what are known as percentile ranks. These are the p values on 
the horizontal scale for such a curve. The percentile rank of a 
given score is therefore the per cent of the observations below that 
score in the distribution. In obtaining such ranks from the ogive 
it is only necessary to find the given score on the vertical scale, 
run across horizontally until the curve is met, and then drop 
down at right angles to the required p value, or percentile rank. 
As an example, making use of Fig. 32, let it be required to find 
the rank of a man who received his Ph. D. at 36. The result, as 
shown, is a percentile rank of about 91. This means that out 
of one hundred such men nine were older when they received 
this degree. 

It may be noted that for percentile ranks 100 is high and 1 is 
low, which is contrary to the ordinary practice of assigning 1 to 
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the highest score in a series. Since the p values should naturally 
increase with an increasing value of the variable, this reversal 
seems justifiable in the case of percentile ranks, which need not 
be confused with ordinary ven 

ranks if properly specified.  °%5 

A formula for percentile 36 

ranks may be derived, mak-  ,,; 
ing use of numerical rather 
than graphical interpola- 


tion as illustrated above. 89.75 91.5 
This method will be first il- Fre. 34. Illustrating linear interpolation 
lustrated by an enlargement with the ogive 


of the portion of the ogive 

including age 36 as shown in Fig. 34. From similar triangles it 

one == sho SNe, or a = .875. The re- 

quired percentile rank is therefore 89.75 + .875 = 90.625. Such 

great accuracy as this is, of course, rarely necessary, and the 

final result may here be written 90.6, or possibly 91, as before. 
The formula for percentile ranks may now be set up by letting 


X = the value of the given score, 
R, = its percentile rank, 
1. l. = lower limit of the interval containing X, 
R,, and R,= the percentile ranks of the upper and lower limits 
of this interval, and 
h = width of the class interval. 
We therefore have roe Percentile y 

R,=R+ ee (X—11.). {rk formula, + (28) 


form 1 
As an illustration of the use of this formula we may compute 
once more the percentile rank for the age 36. From Table 26, 
_ 100 x 366 _ = 100.5359) : 
Y ft SO oe 91.5, and kr= eae as 89.75. We there 
fore obtain . 
Reo = 89.75 + eae (36 — 35.5) = 90.625 = 91. 


is apparent that 
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A more direct and usually more convenient formula for per- 
centile ranks may be obtained by letting 


f, = frequency of the interval where X occurs, 
and fu» = frequency up to this interval. 


We may then write 


= 100[f. (X—1L.1) + Gup)A] | ‘olan ee (29) 


Rx NA formula, form 2 


Thus from Table 24 the frequency at age 36 is f, = 7, while 
the frequency up to this interval is 359. Substituting these 
values in formula (29), we find that 


_ 100[7(86 — 35.5) + 359 X I] 
400 x 1 


as before. The student should show that formulas (28) and 
(29) are equivalent. 

Instead of finding the percentile rank of X it is often sufficient 
to find the percentile rank of the class value of the interval 
where X occurs. In this case, formula (29) reduces easily to 


R36 = 90.625, 


_ 50f, , 100(fup) 50 fx (Class value 
Resa tp at Re {ef BO) 


Applying this form to age 36, we again find .R3.5 = 90.625, since 
36 happens to be the class value of the interval. The rank 91 
would be used for any age between 35.5 and 36.5, according to 
this last approximation, and this is often sufficiently accurate 
with a narrow class interval. 

The percentile ranks of a set of scores often make a very con- 
venient record for administrative use. This may be illustrated 
in the case of a group of graduate students who were given an 
intelligence test. The gross scores were not used because of the 
lack of suitable norms for such groups. By converting the 
scores of the tests into percentile ranks, the relative position of 
each student in the group could be seen at a glance. Thus John 
Doe’s graduate record might appear as follows: 
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General average of college marks ........ A-— 
Estimated fitness for research ......... Excellent 
Bers onalityaimetr aca -misaa deplst sokk se Be sesh Saw 5 Pleasing 
Experience in college teaching ......... 3 years 
NC CRORE Me akc ek dese. ei ai Cote Se ee ie ee 25 
Percentile rank in intelligence test. . . ..... 97 


The youth, high scholarship, and personality of this man are 
indications of future success in college teaching. His percentile 
rank of 97 is additional evidence in this respect, because in 
a group of 119 students, he was exceeded by only 8 per cent in 
general mental alertness. 

Inasmuch as a very accurate rank for such purposes is not 
required, the graphical method of determination from the per- 
centile curve may be conveniently used. An error of 1 per cent 
in the percentile rank of a student will make no difference in 
the administrative interpretation of the test result, and this 
degree of accuracy may be easily obtained from the ordinary 
free-hand graph. 

While percentile ranks will furnish the medians and quartiles 
of the original distribution of scores, the ranks should not be 
treated like actual scores. In combining scores from several 
tests, for example, it would not be legitimate to add the raw 
_ scores from certain tests to other scores expressed in the form 
of percentile ranks. The distribution of such ranks will approx- 
imate a long rectangle, the standard deviation of which is of 
doubtful significance. It is therefore much better to keep the 
data in their original form for most purposes and to convert the 
items into percentile ranks only for such uses as those which 
have been described above. 

In general the whole percentile method is cruder but some- 
times more convenient than methods in which the raw scores 
are employed directly. Percentile curves and ranks are in ex- 
tensive use at present, but for careful analytical work it is usu- 
ally best to employ methods based on the actual rather than 
on the relative values of the scores. 
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EXERCISES 


1. Calculate the nine deciles by formula (27) for the distribution 
of science teachers’ salaries given in Exercise 4 of Chapter VII. 

(Pio = $76.13; Poo = $80.93; P2 = $85.50; P= $88.77; Peo 
= $92.81; Peo=$99.63 ; Pro=$102.65; Pso=$106.57 ; Poo = $114.80. 
Ans.) 


2. Construct a cumulative frequency curve for the data of Exer- 
cise 1, and check the above deciles by graphical computation, check- 
ing by the graphical method. 


8. Work out the nine deciles for the distribution of the fourth- 
year high-school group from the table in Exercise 3, Chapter VI. 

(Pio = 66.74 ; P29 = SB 5 P39 a 92.76; P49 = 07-25 ; P59 
= 110.913 Reo =119.01; Py =128.84; Py=188.98 > Po — 152.12: 
Ans.) 

4, Construct a percentile curve from the values obtained in 
Exercise 3. 

5. Calculate a table of class-value percentile ranks, using for- 
mula (80) and the data of Exercise 3. Check by the cumulative 
frequency curve. 


6. Compute by formula (29) the percentile ranks for the fol- 
lowing scores: 167, 35, 171, 81, and 104, using the distribution of 
Exercise 3. (96.4, 1.8, 97.5, 19.7, 42.0. Ans.) 


7. Prove that formulas (28) and (29) are equivalent. 


GCHAPTERAIX 
LINEAR CORRELATION WITH QUANTITATIVE SERIES 
1. THE MEANING OF CORRELATION 


Correlation is sometimes defined as the concomitant variation 
of two traits. This definition may be illustrated by the scores 
of fifty pupils on the Otis and Chicago group intelligence tests 
listed in Exercise 1, 
Chapter II. In run- 
ning through the 
pairs of scores for 
each pupil, it will be 
noted that a high 
score on one test is 
usually associated 
with a high score on 
the other, while a 
low score on one test 
tends to be paired 


80 90 100 110 120 130 140 150 160 170 180 190 Otis 


with a correspond- 

ingly low score on FG. 35. Scatter diagram of the Otis and Chicago 
By test scores. (The numbers identify the scores 

the second test. given in Exercise 1, Chapter II) 


This relationship, 

or correlation, is brought out more clearly by means of a scatter 
diagram, which is merely a plot of the associated pairs of scores 
as shown in Fig. 35. a 

| There is a general tendency in this diagram for the points to 
form a straight band across the graph,| and this furnishes graph- 
ical evidence of linear correlation. The narrower the band and 
the more closely the points cluster along a straight line, the 


higher such correlation becomes. 
141 
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In Fig. 36 another scatter diagram is shown, but the band 
in this example forms a distinct curve. The correlation in 
this case is regarded as non-linear, but like linear correlation 
the relationship between the two variables becomes closer as 
the points form a narrower and narrower band, finally approx- 
imating a single-valued mathematical function. 


Expenditure per pupil in dollars 


Percentage of expenses derived from the state 


Fic. 36. Showing the relationship between per-pupil expenditure and per- 
centage of total school expenditure derived from the state. (Data supplied by 
Dr. R. E. Wager) 


Perfect correlation is reached when all of the points in the 
scatter diagram fall exactly on a curve.* Two examples of such 
relationship are shown in Fig. 37, one for linear and one for 
non-linear correlation. With observed data, perfect correlation 
is, of course, impossible but very close approximations are often 
reached in verifying physical laws such as, stress = k X strain. 


* Tt will be remembered that cwrve is a general expression for the designation of 
both linear (straight line) and non-linear functions. 
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In view of the above discussion we may now define correlation 
as the tendency for two observed variables to be related in the form 
of a single-valued mathematical function, or, more briefly, as the 
tendency toward single-valued functionality. A single-valued 
function is one such that for any value of the argument only 
one value of the function results. 

The present chapter will deal with linear correlation for quan- 
titative series, while Chapter X will be devoted to the measure- 
ment of curvilinear relationship. For both types of correlation 


Perfect linear correlation Perfect non-linear correlation 


Fic. 37. Illustrating two types of perfect correlation 


it is possible to express the degree of association in numerical 
terms, and obtain an equation for the mathematical curve 
which most closely approximates the data. 


2. THE PRoDUCT-MOMENT CORRELATION COEFFICIENT 


In Fig. 38 on page 144 it will be noted that the plane has 
been divided into four quadrants by erecting perpendiculars at 
the means on the two scales. Designating these quadrants in the 
usual way, it appears that points located in the first and third 
quadrants will tend to produce high correlation, while points 
located in the other two quadrants will tend to reduce the 
amount of such correlation. When the points are scattered 
randomly over the plane, the correlation will approach zero. 
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A measure of relationship might then be obtained by con- 
sidering the products of the deviations from the means, that 
is, c= X — M,, and y= Y—M,. Fora point such as P;, the 
product xy will be positive, while for P, in the second quadrant, 
the xy product will be neg- 
ative, etc. The average 
of all such pair-products 
might be used to measure 
correlation, were it not for 
the fact that the deviations 
are expressed in the units 
of the respective scales. In 
order to overcome this dif- 
ficulty it is only necessary 
iva Otis xX  tousestandard scores, that 


Fic. 38. Illustrating product moments in is, x and y , and the result- 

four quadrants oy 
ing product average will 
then become a pure number. Thus for N pairs of associated 


points the product-moment correlation coefficient becomes 


“1 Yi | T2Y2 , 13 Y3 .  ©N YN 
Guiegice = 


- 


Representing this coefficient by the symbol r, and denoting the 
sum in the usual way, we have 


z Product-moment 
As correlation coefficient, + (31) A 
original form 


or epee DS (32) y 


since on =. |=™ and Cy = e. 
N N 


The product-moment coefficient is thus the arithmetical mean 
of the pair-products of associated standard scores. The above 
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formula was developed originally by Karl Pearson,* who based 
his proof upon the product-moment function of Bravais, and a 
method of Galton’s closely related to the above standard scores. 
It is therefore often called the Pearson correlation coefficient. 
If all the points in the scatter diagram lie on a straight 
line, the equation of this line through the origin (Fig. 39) will 


T=4+1 r=0 T=-1 


Fic. 39. Illustrating extreme variations in the correlation coefficient 


be y = mz, where + m is the slope of the line. The value for 
the correlation coefficient in such a case may now be determined 
by noting that o, = maz, 


eller ee = pee 
Noe News 
and that Ley =+ m2ax? =+ mNo;?. 
We may then write 
Dry mNo.2 
= = ils 
Nowy mNo,? + 


In ease all the points lie on a horizontal line with a zero slope, 
the value for r becomes indeterminate, that is, >. A symmetri- 


cal arrangement of the points about such a line, however, will 
give zero correlation, as shown in Fig. 39, because the quantity 
Lxy is zero while N, oz, and o, are not zero. With actual data, 
therefore, the correlation coefficient may range in value from 
—ltol. 


* A full discussion of the history of correlation is given by Pearson in Biometrika, 
Vol. XIII (1920). Here Pearson assigns most of the credit to Galton and minimizes 
- the significance of his own important contribution. . 
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3. COMPUTATION OF THE CORRELATION COEFFICIENT 
WITH UNGROUPED ITEMS 


While formulas (81) and (82) are useful for calculating the 
correlation coefficient, the arithmetic may become rather tedious 
on account of the fact that the deviations x and y are usually 
given in the form of decimals which would need to be multiplied 
and squared. In order to overcome this difficulty an alternative 
formula will be given. : 

Remembering that x= X—M,, and that y= Y—M,, 
we may write 

sy = XY — YM, — XM,+ M-M,, 
and Yay = UXY — M,ZY — M,2X+ M-2M,2=(1) 
= DXY—NM:.M,, 
since 2xA= NM, 2Y = NM and 2) wW.- 


From the chapter on dispersion it is also evident that 


Ie 
= iy 
vt \ N 


2 
and that o,= ee — M,?. 


Substituting these values in equation (31) gives the desired 
formula, 2 


2uXY — NM,M, . { Correlation Mane: (33) 


ip 2. llleaeelee___ 
\ /(z X? — NM,?)(2Y? — NM,?) (based on raw scores) 


This expression, although more complicated in form than 
formula (32), is generally preferable to the latter because it in- 
volves the integral scores X and Y rather than the deviations 
x and y as decimals. By designating the total as T, it is evi- 
dent that T = NM, and the above formula may also be written 


SS { 
V (2X? — T,M,)(2¥? — T,M,) 


It should also be noted that when the variables are measured 
from arbitrary origins, A, and A,, the last two formulas may be 


orrelation coefficient 
equivalent to (383) } 32) 
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applied to X’ = X — A,, and to Y’= Y—A,. This follows at 
once from the fact that 

M’, = M,— A,, and M’, = M,—A,, 
so that Cow and ay =Y. 
The primed variables may then replace the unprimed variables 
throughout in formulas (83) and (84), which means that the 
correlation remains unchanged when any numbers A and B are 
subtracted from X and Y, respectively. 

The above formulas will now be applied to a short problem 
with the data in the form of listed pairs of associated scores on 
two tests, X and Y. The material is too scanty to have any 
practical value, but has been chosen because the arithmetic is 
short and the attention may be fixed on the form of computation. 


TABLE 27. ILLUSTRATING THE COMPUTATION OF THE PRODUCT-MOMENT 
CORRELATION COEFFICIENT BY DEVIATIONS FROM THE MEANS 


PUPIL ae pees x y Kod y? xy 

TOT YC ee 16 17 = 1. Oe 0:2 3.61] 0.04 | +0.38 
a 74 15 23.9) = 2.2 (|) s18.21) 4.94 |) 48.58 
Cir tialy > x. *i. 82 14 Le Ae SOM 16.81 10240 — 1912 
in} oe 63 12 |—14.9| —5.2° | 222.01 | 27.04 | + 77.48 
BP en ner , 2: 74 18 Pas ON ea): Se ih bb.21-4 1 O62 9| S42 
La, 91 19° P31) | 5648+) 171.61 | 324. | 23.58 
Ce er 86 20 +81] +2.8 | 65.61] 7.84 | + 22.68 
TL ge aden nae 82 23 +41 | +5.8 | 16.81 | 83.64 | + 23.78 
Lc ea eee 79 20 S54 | says 1:21) 784" | 41308 
sit: 92 Cn 72 14 APO F269 %). B4/81 1.10.24 21-4 18.88 

17.9 17.2 562.90 105.60 162.20 

(Mx) (My) (2a?)  (Zy?) (Zay) 


In applying formula (32) it is first necessary to obtain the 
means and deviations from the means for the two series, the 
latter being given in the columns x and y above. The squared 
and product terms are then formed and the sums 2x?, Dy?, and 
2zy obtained. The value for the coefficient then becomes 


162.2 162-20 62.255 
pee ees oes UO a =) 665. 
V562.9 X 105.6 VW59442.24 243.8 
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When using formula (83) or (34), it will be found convenient 
to reduce the scores before calculating the necessary quantities. 
Subtracting 70 from each of the X scores and 15 from each of 
the Y scores gives the X’ and the Y’ series shown in the follow- 
ing illustration (data from Table 27). 


TABLE 28. ILLUSTRATING THE COMPUTATION OF THE PRODUCT-MOMENT 
CORRELATION COEFFICIENT BY DEVIATIONS FROM ASSUMED MEANS 


= 
PUPIL Xx!’ Die? (X’)2 (Y’)2 CX7Y*) 
ONT oes A 6 2 36 4 12 
Baraetice eccrerns e 4 0 16 0 0 
Cie tard So meta es 12 = ih 144 1 —12 
Dee rae = —3s 49 9 21 
1 oo oft od 4 3 16 9 12 
1 Rea OR Cates 21 4 441 16 84 
Ge Seta c Rejesa 16 5 256 25 80 
Iso pak oo a 12 8 144 64 96 
Tees crenb rede ayes 9 5 81 25 45 
Bhogal ct 2 =i 4 1 —2 
Total sieacaten te 79 22 =|1187 154 336 
Me AS is ee 1.9 2.2 624.1 = T’xM'x 48.4= T’yM’y | 173.8 = T’xM'y 
TEESE es 624.1 48.4 562.9 105.6 : 
162.2 


TleM'y = TlyM's = 173.8. 


The squaring and multiplying may now all be done mentally 

and the computation arranged as in the foregoing scheme. Using 

formula (34) we then have , 
162.2 = + .665, 


~ »/562.9 X 105.6 
as before. 


Of the two types of calculation shown above, the second is 
usually the easier, although both become rather tedious with 
a long series. Formula (83) and occasionally formula (32) are 
recommended for short series of, say, 20 to 30 pairs of scores, 
which do not warrant the use of a frequency table. 

As a warning to the student, it may be noted that correla- 
tions based upon such a small number of cases are not of much 
significance. In experimental work, however, problems of this 
sort do arise, and it would then be convenient to employ the 
above methods. 
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4, THE COMPUTATION OF THE CORRELATION COEFFICIENT 
FOR A FREQUENCY TABLE 


A two-way frequency table, or correlation table, may be made 
by noting the frequencies which occur in the cells bounded by 
certain class limits on the two scales. Thus, by taking class 
intervals of 79.5-89.5, 89.5-99.5, 99.5-109.5, etc. for the Otis 
test, and 29.75-34.75, 34.75-89.75, etc. for the Chicago test, a 
scheme for tabulation may be set up as shown in Table 29 (see 
data from Exercise 1, Chapter II). Instead of recording a pair 
of scores as a point on a scatter diagram, it is only necessary 
to make a tally in the cell within which this pair of scores must 
lie. All frequencies in a given cell are then assumed to have 
the class values of that cell. For example, the scores of the first 
pupil are 171 and 52. This pair of scores is recorded by a tally 


TABLE 29. CORRELATION TABLE FOR THE OTIS AND CHICAGO TEST SCORES * 


8 
ie} 
n 
° 
G 
< 
oO 
lol 
Ee 
12} 


* The exact class limits have not been set down in the table, but these should be 
kept in mind in tabulating the frequencies and in subsequent calculations for the 
means, 
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in the cell indicated by 169.5-179.5 on Otis and 49.75-54.75 on 
Chicago, with the resulting class values 174.5 and 52.25. It will 
be noted that similar errors (differences between class values 
and observed values) appear throughout the table, but that 
the combined effect of these upon the correlation coefficient 
will not be large if the class intervals are fairly numerous on 
both scales, for example, from 10 to 20 intervals as in the case 
of the simple frequency distribution. . 

Unless a mechanical sorting device is available the best way 
to make a correlation table is with the aid of the small data 
tickets described in Chapter II]. These may be sorted into a 
simple frequency distribution according to one of the variables, 
and each of the sub-piles then sorted for a distribution of the 
associated variable. It will be found convenient to write down 
the class limits on small slips of paper, laying these out in a row 
on a long table. The cards are then sorted into piles and the 
work verified by running through each one. These piles may 
then be secured with rubber bands and a new series of class 
intervals prepared for the next variable. As soon as each pile 
has been sorted in this way, the results may be tabulated in 
the appropriate column on a sheet of square-ruled paper or on 
a special form to be described below. 

In calculating the correlation coefficient for a frequency table 
it will be necessary to modify formula (88) so as to bring in 
the frequency notation. Let 

f, = the frequency of a column of type 2, 
f, = the frequency of a row of type y, 
fry = the frequency of a cell common to such a column 
and row, 
d, and d, = the deviations in class intervals from the assumed 
means on the two scales, 
h and k = the widths of the class intervals for the variables 
X and Y, respectively, and 
X’ and Y’ = the variables measured from arbitrary origins, A; 
and A,. 
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It is now evident that d, = x and d, = = 
so that LX'Y' = 2d,d,hk = (Zfey - dody)hk, 


Jz, being merely a symbol of operation. In a similar way it 
appears that 


NM'M', = GLE CMM gy, 3x = (Bf.d.2\n?, 
and N(M’,)? = ia 
Substituting these values in formula (33) gives 
Srydedly — See 
"7 | Bde a oer | Bi? es Che TUE 


{Correlation coefficient for distribution table} 


(35) 


from which it follows that the correlation coefficient is quite 
independent of the magnitudes of the class intervals and of 
the units of measurement. The three principal terms in this 
expression have been designated as a, b, and c for convenience. 

The complete calculation with formula (85) is illustrated in 
Table 30 on page 152 for the Otis-Chicago data. 


Peas a ~ 170 — 7.2 [162.8 
hone a — 210 — 11.52 —[198.48 


xe (5) fal 
= 225 — ha 225 — 4.5 = 1220.5 
pees pe 162.8 162 8 162.8 | oe 
Voce 198.48 x 220.5 ~V43764.84 209.2 


By four-place logarithms, 


log b = 2.2978 log a = 12.2116 —10 
log c = 2.3434 log cod = 2.3206 
_ log prod. = 4.6412 logr= 9.8910 —10 


log V prod. = 2.3206 “r= |+ .778 
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The computation down to the quantities Df,d,2 and Df,d,? 
is the same as for the standard deviation, so that the values 
for b and c may be readily obtained. 

The calculation for a presents a little more difficulty. The 
quantity 2f,,d.d, is the result of multiplying.each cell frequency 
by its d, and d, and then adding all the products so formed. 
A more convenient method of calculation, however, is to mul- 
tiply the cell frequencies in a particular column by the appropri- 
ate d, values, add the results thus found, and multiply this 
sum by the d, value of the column. Continuing in this way 
for all the columns, and adding the products thus found gives 
the required 2f,,d,d,..Thus, the frequencies in the column at 
150-160 on Otis have been multiplied by the corresponding 
d, values and the results recorded in the lower left corners of 
the.cells as 5, 6, 4, 4, 0 (coming down from the top). The sum of 
these quantities is 19, which, when multiplied by the d, value 
2, gives 38 as the contribution of this column to the total 
Dfxd,d,. The same result would have been obtained if the cell 
frequencies had been multiplied by d, and d, at the same time 
and added, thus: 


1x2x5+2x2x34+2x2x24+4x2x14+2x2x9 
Peet 12) i Be to a By Fy 4 Oy e388 


The work is shortened, however, by factoring out d,, which is 
~ common to all the products. 

The symbol has been used to indicate summation over the 
whole table, that is, over N items. In order to distinguish sum- 
mation over the arrays (columns or rows), this has been desig- 
nated in the table by d’. Thus, 2’f,,d, means the sum for one 
column of f,,, multiplied by the corresponding values d,. 

A very useful check on the computation of a is shown by the 
double arrow in Table 30. The sum of the quantities 2’f,ydy 
should be the same as Df,d,, or 2(2’fydy) = Zf,dy. Until he 
becomes proficient in working with the numbers in the corners 
of the cells, the student should always use this check. 
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The correction (2f, wot dy) applied to 2f,,d,d, will some- 


times be positive and sometimes negative, and it should be 
remembered that it is to be subtracted algebraically. The re- 
mainder of the calculation for 7 is shown in the example both by 
straight arithmetic and by logarithms which are recommended. 

Computations of this length need to be carefully planned and 
arranged in order that they may be done quickly and accurately. 
A standard form has therefore been prepared by the writer. The 
data may be recorded directly on this sheet (Table 31), and the 
calculations performed very rapidly. 


5. LINES OF REGRESSION 


In the problem just worked out, it was assumed that the 
trend of the data in the correlation table was such that the 
linear method might be applied. As already pointed out, 
the maximum correlation will occur when all the points fall 
on a straight line; but with any scatter in the data, two lines will 
be obtained for a correlation table, one from the means of the 
columns and one from the means of the rows. The graphical 
test for approximate linearity and the justification of the use of 
the product-moment method are furnished by plotting the 
means of these arrays and noting the extent to which they fall 
on these two straight lines. A more rigorous test will be given in 
Chapter X, where it is shown that lack of such linearity reduces 
the amount of the product-moment correlation. 

The curves fitting the means of the columns and rows are 
known as regression curves. They will be illustrated with a larger 
body of data than that used above, on account of the small num- 
ber of frequencies in the Otis-Chicago table. The material in 
Table 32 was supplied through the courtesy of Mr. Douglas E. 
Scates.* The values on the horizontal scale are the percentage 


*“A Study of High-School and First-Year University Grades,” The School 
Review, Vol. XXXII (March, 1924). 
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grades of students based on four years’ work in high school, 
while the vertical scale gives the average mark of these students 
in grade points after three quarters of residence in The Uni- 
versity of Chicago. If the means of the columns be denoted 
by Y. and the means of the rows by X,, these values may be 
computed from the table and arranged as follows: 


TABLE 33. MEANS OF THE COLUMNS AND ROWS OBTAINED FROM TABLE 32 


x Ys Yi XG 

SOLS Weis. 5. 1.75 =| OU Reece ts ues 83.0 
SIL OMP A eta Nee ce 1.54 0/602. da os 84.1 
SZ OMe cic cl 3. * 1.66 =30:33 a Balle 83.9 
S32B Weta Se. ss 1.87 OL00 NR re 83.2 
SA SMe uke 1.99 O:S Sarena eee 83.7 
SD. OMB et os ss. 2.23 OC raity sec, 83.9 
SG:Dimiucm a « : 2.22 100 erecta 84.1 
Siege ess 2.70 TSS Hy k Pen eee 84.3 
CEM Gc ee ee 2.86 UAH Ger claro s 84.3 
BOSD Ml hs. a) secs 3.08 200 rks 85.5 
GUYS. do lea a a 3.38 A200 Rahs Seer. 85.6 
SIR Ot nets se: osc is 3.40 VA Mh ee Oe op 86.7 
OPS og Sb Sengene 3.58 3200S ee nee es. 87.7 
Behn as Seceee 3.95 G03 eo ee 87.9 
DAD Rs elke. eis 4.20 SIO (erebetnroacyt: 88.1 
OD SDee et ares 4.37 LA iis ee Bee 89.7 
OG eer rh Pet 4.81 AOS! eh one 5) e 90.6 
OF, Om noses ets. Us 5.14 ASG ime uct 92.6 
B00 is rete 92.2 

CME) Gui Hope 92.6 

BGTaea 4 ot ae 94.0 

GROOMS & oo 95.5 


In Fig. 40 the means of the columns have been plotted as 
dots and the means of the rows as crosses. The former array 
of points fits rather closely a straight line drawn through them, 
but the means of the rows form an irregular curve. While free- 
hand curves drawn through both sets of points would give rough 
approximations to the regression curves, it is better to employ 
a mathematical method for fitting such curves. In the following 
paragraph we shall, therefore, discuss a method for obtaining 
the best-fitting regression curve in the form of a straight line. 
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00 
80 81 82 83 84 85 s6™“@zg7 g8 89 90 91 $2 93 94 95 96 97 98 
Fic. 40. Regression lines for 1707 grades 


The dots represent the means of the columns and the crosses the means of the rows 


If we select the line for the means of the columns it is evident 
that, when taken through the mean of the table at M (Fig. 41), 
its equation will be of 
the form 


y= mex, 


where ™ is the slope to 
be determined, and 7 de- 
notes a point on the line. 
If P represents any point 
in the table, its vertical de- 
80.5 82.5 84.5 86.5 88.5 90.5 92.5 94.5 96.5 98.5 viation from the line will 
Fia. 41. Illustrating the derivation of the be y—Y, ae shown in the 
regression line accompanying figure. The 

problem now is to select 

m so that the sum of the squares of these deviations (residuals) 
for the N points shall be as small as possible, that is, so that 
2(y—y)? shall be a minimum. Replacing ¥ by mz, and ex- 
panding, we may write the necessary condition in the form 
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Ly? — 2 mzaey + m2Lx? = a minimum, 
or 0,7 — 2 mro,0, + mo,” = a minimum. 


Differentiating * this last expression with respect to m, and 
setting the result equal to zero, gives 


on 
m=r—, 
o 


x 


which is known as the regression coefficient of y on x. The re- 
quired equation through the origin thus becomes 


o Regression line for 
US (x =). means of columns, re- > (36) 
Ox ferred to mean of table 


In case the student is not familiar with the differential cal- 
culus, the above result may also be shown in the following way : 


Setting S,? = 0,2 — 2 mro,0, + m?a;,?, 


we shall assume m to have the value r 2”, and show that any dif- 


x 


ferent value will produce a larger squared sum. It follows that 
Sy? = 0,7 — 2 770,27 + 120,7, 


or Sy = oyV1 — Pr, J Standard ea (37) 


(of estimate 


Taking m= rt +5, 
we find that 
$7 = 46,7? — 2 17?o,? — 2 10,0,6 + 170,? + 2 10,0,0 + 0,767, 
On 8!,2 = 0,2(1 — 17?) + 0,26. 


No matter whether 6 be positive or negative, S’,? is greater than 
: 0. 
S,? and the minimum value for m is therefore ‘tae 
By similar reasoning it may be shown that 
oe Regression line for 
x= (r )y, means of rows, referred } (38) 
Oy to mean of table 


* If the reader is unfamiliar with the calculus he should skip to the following 
paragraph, 
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Or: “ z, 
where 7 = is the regression coefficient of x on y. The two re- 
7] 


gression lines given by (36) and (88) furnish not only the best 
fit to all the points in the table but also to the means of the 
columns and rows when the deviations are weighted by the 


. ° . C. 
frequencies of the arrays.* The regression coefficients les and 


xz 


raps ; aS ae : ’ 
Pas give the average change in y and z for a unit change in x 
y 


and y, respectively. 

In case the variables are taken from the origins of the meas- 
urements, equations (36) and (38) may be transformed.by the 
relations x = X — M, and y= Y — M,, giving 


lines in score 
form 


= aye 
ye rene Sor aa Regression (39) 


and Xr Yt My Me. (40) 


Using the notation a, b, and ¢ as in the calculation of the cor- 
relation coefficient, and denoting the class intervals on X and Y 
by h and k, respectively, two other equations may be obtained. 


Since ee -=(./2 )h, mnie: =(,/2 Jk fiiollows thee 
Vbe men N 


Yo Mee Regression lines in) (4 
bh bh a < score form and sym- a 
Seah ah bols on correlation 
and s=—> a) Y— arn — M,+4+ M,x. sheet (42) 


These last equations are the easiest to calculate, since all the 
necessary quantities may be obtained directly from those given 
in the work for the correlation coefficient. 

For the university and school data in Table 32 we find that 


a = 17,468, Me= 60.11, 
b = 28,888, M,=251, 
c = 28,249, h=11, and k= 4 


* Yule, Introduction to Statistics, p. 172. 
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Substituting these values in equations (41) and (42), we find 


that Y = .2019 X — 15.00 
and X = 1.855 Y + 82.05. 


These regression lines are plotted in Fig. 40 on page 158. 
Representing the regression coefficients by 


R 
(eel eyed Be Ue 
y. r Cx bh : ; (43) 
3 ah {Regression coefficients} 
and Oxy = oF = eh (44) 


it follows at once that r= Vb,,-b,,. For the above data, 
therefore, r= V.3745245 = .6120. The correlation coefficient 
is thus the geometric mean of the two regression coefficients. 
Equations (48) and (44) also show that while b,, and b,, are 
- functions of scale units, their product is a pure number. 

Returning to the equation for the means of the columns, it is 
evident that it may prove useful in predicting the most probable 
(average) university grades Y for given high-school grades X. 
Thus a student entering The University of Chicago with a high- 
school average of 95 will probably make a university average of 
.2019 x 95 — 15.00 = 4.18 grade points, or a little better than 
B; while a student entering with a high-school average of 85 
will most likely have a university average of 2.16, or slightly 
better than the required C. 

A measure of the value of these predictions is given by the 
standard deviation of all the observed variations from the re- 
gression line. This quantity, which is known as the standard 
error of estimate, has already been presented in equation (37) 
for the line through the means.of the columns. Working out a 
similar formula for the rows and multiplying the results by .6745 
in order to obtain the probable error * of estimate, we have 

* For a complete discussion of probable error, see Chapter XIII. The probable 


error is so defined that half of the errors lie within the limits, mean — probable 
error and mean + probable error, or M+ P.E, 
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\ Probable error of estimate 
\ P.E. (est. Y) = .6745 Oy’ 1—r { in predicting Y from X } (45) 
\ and 


\ p.E. (est. X) =.6745 0, V1 — 


Probable error of ee ( 46) 
in predicting X from Y 


Since o, =(\/&)k = 1.356, the probable error of estimate 


for university grades is .6745(1.356) V1 — (.612)? = .723. Such 
calculations are facilitated by the use of logarithms of V1 — r? 
given on page 54 of Hol- 
zinger’s Tables. 
The complete equation for 
prediction may now be 
written. 


Y = .2019 X — 15.00+.72. 


The use of this equation may 
be illustrated in the case of 
Fic. 42. Illustrating the probable error a student with a high-school 
of estimate average of 95. Substituting 
X = 95, we find that Y, or 
the most probable university average, is 4.18 grade points. 
This result is written 4.18 + .72, with the interpretation that 
it is an even chance that the student’s university averagé will 
be anywhere from 3.46 to 4.90 grade points, or between B— and 
A—. This conclusion may be drawn from the fact that half 
of the observed deviations from the predicted mean (4.18) lie 
between these two values, or, as shown in Fig. 42, half the area 
of the curve lies between these limits. (See Chapter XII for 
further interpretation of probable error.) 

As might be expected, the value of the prediction becomes 
better as the correlation increases. When r = 1, the standard 
error of estimate is zero and the prediction is perfect in the 
sense that all the observed values lie on a straight line. For a 
list of cautions to be observed in using regression equations, 
see Chapter XV, section 4, 


<r 


3.46 4.18 4.90 


LINEAR CORRELATION 163 


The meaning of the term “regression,” which is due to Sir 
Francis Galton, may be shown in the case of the line for pre- 
dicting the height of sons from the height of their fathers. 
The equation of this line is approximately 


S=.5F+34" (Ms = Mp = 68”) 


where S represents the son’s height and F the father’s height in 
inches. By substituting a few values of F near the mean we find 
the results which are listed in the following table. 


S F F S-F 
64 60 + 4 
65 62 +3 
66 64 +2 
67 66 +1 
Mean 68 68 0 
69 70 -—1 
70 2 —2 
al 74 —3 
el 76 —4 


The column S — F shows regression, or the tendency of the 
son’s predicted height to be nearer the mean than the height of 
his father. Thus a father 74 inches tall will be expected to have 
a son only 71 inches in height, while a father 62 inches tall will 
most probably have a son 65 inches in height, the son’s height 
each time regressing toward the mean of the race. This is one 
of the important laws of inheritance. 

Galton’s term “‘regression”’ continues to be used for any sort 
of curve which fits the means of the arrays in a correlation table, 
even though no problem of inheritance is to be considered. 


6. THE INTERPRETATION OF THE CORRELATION COEFFICIENT 


In the case of perfect correlation between two variables the 
association is regarded by some writers as a causal one, and a 
smaller degree of correlation as an approximation to causal 
relationship. It is usually best, however, to avoid this inter- 
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pretation and to regard the correlation coefficient merely as a 
mathematical expression for the degree of association between 
the traits, regardless of the factors producing the result. This 
may be illustrated by the correlation between height and score 
on an intelligence test with a group of pupils from Grades III 
to VII. The correlation found was .71, but it would be absurd 
to say that the physical growth caused the mental growth, or 
vice versa. The relationship observed was largely due to a third 
factor, age, which when eliminated reduced the amount of the 
correlation to only — .06. 

All statistical data are affected by a multiplicity of factors 
which may obscure the meaning of the relationship found be- 
tween two observed variables. For example, the correlation 
between high-school and university grades found on page 161 
was .612, a result doubtless due in a large measure to the men- 
tality of the student. Many other factors, however, such as his 
age, sex, nationality, health, ambition, methods of study, regu- 
larity of attendance, and personal appearance, as well as the 
type of examinations and reaction of the instructors, doubt- 
less contribute also to the observed correlation. Scholarship as 
measured by marks is thus a variable made up of a large num- 
ber of other variables, and the correlation found is of doubtful 
meaning so far as causes are concerned ; its main value here is 
for predicting university grades from high-school grades regard- 
less of the factors affecting such estimates. 

With standardized tests it is possible to obtain results which 
give a better approximation to the correlation between simple 
variables. The tests themselves measure fairly well certain 
aspects of human abilities such as rate in reading, accuracy in 
arithmetic, and quality in handwriting. Proper methods of 
administering and scoring the tests will eliminate to a large 
extent errors of the observer, while such outstanding factors as 
age, sex, grade, and nationality may be controlled by selection 
of the cases. The correlation between rate and comprehension 
in reading on a certain test for fifty pupils aged twelve and in 
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the seventh grade has therefore a good deal more meaning 
than the correlation for scholarship quoted above. 

One rather common interpretation of correlation is that it 
shows the percentage of agreement between the associated traits. 
Thus a coefficient of .90 would show 90 per cent agreement, 
_ while a coefficient of .45 would show 45 per cent agreement. 
This interpretation is entirely misleading since the intensity of 
association does not vary directly as the size of the correlation 
coefficient. 

Another custom in dealing with correlation is to classify the 
coefficients as ‘‘high,” ‘“‘medium,” or “low.” Thus .75 would 
generally be regarded as “‘high,”’ while .25 would be considered 
as “‘low.”’ This terminology may be convenient in dealing with 
test material where the percentage of coefficients above .75 and 
below .25 is small, but may be misleading when dealing with 
other types of data. In an age-grade table, for example, a 
correlation of .75 would be found by comparison with similar 
coefficients to be relatively low. Another misconception some- 
times occurs in interpreting a “high” coefficient, such as .7, as 
meaning almost perfect agreement. How far this is from the 
truth may be seen by mere inspection of the scatter diagram 
for values of this size. 

An interpretation of the correlation coefficient that is of some 
theoretical interest may be illustrated by a problem in dice 
throwing known as Weldon’s experiment.* Twelve dice were 
shaken in a box and thrown again and again, the number of dice 
showing four or more spots on the upper face being recorded. 
When the results of the first, third, fifth, etc. throws were 
paired against the results of the second, fourth, sixth, etc. 
throws, no correlation was found because all the events were 
quite independent of one another. 

Next, half of the dice were stained red and after throwing 
them all and counting all those showing four or more spots, 


* William Brown, Hssentials of Mental Measurement, p. 78. Cambridge Uni- 
versity Press, England, 1911. 
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the second and every alternate throw thereafter was made by 
leaving the red dice upon the table, but counting both colors 
when computing the score. The number due to the red dice 
was thus common to the two scores. By continuing in this way, 
two series of odd and even throws were formed for which the 
correlation approached the value 6/12, or .5. In more general 
terms, if n is the total number of dice thrown, and c the number 
common to the pairs of throws, the expected correlation will be 


‘ 
ee 


n 


r 


The correlation coefficient may thus be regarded as the ratio 
of the number of equally effective elements which two variables 
have in common to the total number of independent elements 
constituting each, or, more briefly, as the proportion of common 
elements or causes. It is hardly necessary to add that this in- 
terpretation is little more than suggestive in dealing with ordi- 
nary statistical data where systems of causation are extremely 
complicated. 

A final interpretation of correlation arises from a considera- 
tion of the standard error of estimate, ¢,V1—r?. This quan- 
tity, as already noted, gives the error in prediction by use of a 


; x : = Wi o 
single score with the regression equation y = rt. In case 
zx 


r = 0, the prediction has a standard error which is equal to the 
standard deviation of the predicted variable and is therefore no 
better than that which would be obtained by selecting a value 
of y at random from the observed distribution. As r becomes 
larger, however, the predictive value of a single score becomes 
better than that afforded by such a chance estimate, the im- 
provement being measured in percentage terms by 


Improvement 
fo—oV1—r? over chance in 
= ee | = A5 
I, = 100 = ) 100(1— V1 — 9). + er aiction by | AD 


a single score 


* For proof see William Brown, Essentials of Mental Measurement, p. 79. 
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As an example, it may be noted that a correlation coefficient 
of .80 gives a value of J, = 40, which means that the regression 
forecast with a single score is here only 40 per cent better than 
a random guess. 

Interpretations of this sort may be useful in dealing with 
correlations as a basis for linear prediction, but they are likely 
to give a false impression of the intensity of association given 
by the correlation coefficient itself. Thus a coefficient of .80 
may be rather unsatisfactory for estimates by a single score, 
but the value J, = 40 can hardly lead us to regard .80 as a 
““poor”’ correlation with test scores, as one writer suggests. 

It should also be noted that the interpretation afforded by 
formula (47) is, like the preceding attempts, quite arbitrary. 
A very different but equally valid percentage measure could 
be easily derived by considering the squared error of estimate 
instead of the first power. 

The writer is of the opinion that “‘layman’s interpretations ” 
of correlation coefficients should ordinarily be avoided. Such 
attempts, as already pointed out, are usually based on arbitrary 
assumptions and may furnish quite misleading and inconsistent 
results in the hands of the “layman.” The student of statistics 
will soon find that he needs no such devices and his best and 
most useful guide in interpreting correlations will be given by 
a simple scatter diagram with fitted regression curves. Com- 
parisons between correlation coefficients should always be made 
through the use of the sampling errors discussed in Chapter XIII. 


7. SOME USES OF CORRELATION IN EVALUATING 
TEST MATERIAL 


In the preparation of standardized test material several terms 
have come to be accepted with very definite meanings. One of 
these is the reliability of the test, which may be defined as the 
consistency with which the test measures what it purports to 
measure. An index of this consistency is given by the relzability 
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coefficient, which is merely the correlation between two forms of 
the same test given at different times. If X; and X; be the two 
test forms, this coefficient may be expressed as 111. 

Another characteristic of a test is its validity,* by which is 
meant the extent to which it does measure what it purports to 
measure. The evidence in this case must of course be indirect. 
It is customary to select some criterion C, which is known to 
index the trait in question. The correlation r,, between the 
criterion and the test therefore furnishes numerical evidence of 
the validity of the latter. 

Suppose, for example, that five tests are proposed as measures 
of intelligence. By correlating the results of each of these with 
the scores on some accepted scale such as the Stanford-Binet, 
a series of validity coefficients of the form ra = .78, ree = .82, 
rez = .40, res = .78, and res = .48 might be obtained. Tests Xj, 
X>, and X4 would thus be regarded as considerably more valid 
than the other two. In case it is objected that high correlation 
is no sure evidence that the tests are measuring the same thing 
as the criterion, it may be argued that this is the best evidence 
we have and that such correlation shows that the tests have high 
predictive value, which is sufficient justification for their use. 

In case a number of similar tests are pooled the reliability 
and validity coefficients of the lengthened test may be obtained 
by applying Professor Spearman’s { theorem on the correlation 
of sums and differences. These new formulas will be derived 


directly, however, making use of standard scores such as z =~ 
(see Chapter VII). ad 


x ee a 
Let 2= = and 21 = ar be the standard scores on two similar 
1 1 


2 oe 
tests, and let 2; = and.<¢;= = be the standard scores on 
I ng 


* Instead of using the term “validity” some test workers prefer to speak of the 
“predictive value’’ of a test. This is essentially the same property as validity, inas- 
much as both are measured by correlating the test with some criterion. 

+ C. Spearman, “Correlations of Sums and Differences,” British Journal of 
Psychology, Vol. V (1918), p. 417. 
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comparable forms of each. The problem is to determine the 
reliability of 21 + 2’1, knowing the reliability of z;. This will be 
given by working out 

2127 = wee" 7 a e127 4 De"; 


Ta, t+2)@+2) = 
° an : No@, + 2) Oe, +2’) 


which comes from expanding r = = Se when *=%-+2'1 and 
xy 

y=%+2'; The standard deviations o¢, +7, and o¢,+2,) Te 

duce to V2+ 2711, since o,, = oy, = 02, = 07,=1. All the cor- 

relations between the z’s are equal to r17. We therefore have 


2 Y1I 
Te t2yertep) = I rir, 


By induction it may be easily shown that by increasing a test 
n-fold with similar material the cumulative reliability coefficient 
is given by 


A nriz F Spearman-Brown formula for predicting (48) 
™ 1+(n—Dnr reliability of lengthened tests 


_ This expression is often called the Spearman-Brown prophecy 
formula, since it was proved independently by both men. 

As an example let us assume that a test with a reliability 
coefficient of .7 has been prepared. What will the reliability be 
when the test has been made three times as long by the addition 
of similar material? The answer is found by substituting n = 3 
and 717 = .7 in equation (48), giving 


SOG nes 
La a edie 
An empirical check on this formula was made by Miss Blythe 
Clayton* and the writer, with carefully graduated spelling 
material. Seven equally difficult tests with parallel forms were 
given and the results of actual pooling compared with those 
* Karl J. Holzinger and Blythe Clayton, ‘‘Further Experiments in the Applica- 


tion of Spearman’s Prophecy Formula,” Journal of Educational Psychology (May, 
1925), Vol. XVI, pp. 289-299. 
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predicted by the formula. The close agreement between ob- 
served and theoretical values is shown in the following table. 


TABLE 34. OBSERVED AND PREDICTED RELIABILITY COEFFICIENTS FOR 
SUCCESSIVE POOLS OF n EQUALLY DIFFICULT SPELLING TESTS 


OBSERVED RELIABILITY THEORETICAL 
NUMBER OF TESTS POOLED = n COEFFICIENT COEFFICIENT FROM 

FORMULA (48) 
AO, ae oimee ee kareena 2 -743 -143 
Digg ROM Mitte 0 Ses oe eae ce .841 .853 
eae ee ct cheno es ee eee .906 .897 
A tes Pete d= Soy AE ee ee eg -916 . -920 
OTe ke ges oak nae ig -aatsteoeeS -941 -936 
Glee es. &, FY Seen ie? eae .949 .945 
nde cas cp aa ig Ok ee Eerie E955 -953 


The formula for the validity of m pooled tests may be ob- 
tained by working out 
Te(ey + 29 +23 +2**+2n)- 
For three such tests we shall have 
Dcez1 + Vez. + 2cz3. 


Ye(zy + 22 +23) = 
ga ak NOcG(e, + 2, + %) 


Substituting the values for Xcz and o¢, +, +:,), there results 


Tez, + Tez, + Tez, 


[eG a t= ee 
3+2 Tz,2, + 22,2, + 2 T2023 
or Ye(z, + 2 +23) = ae (50) 
Tzz 


if the validity coefficients and correlations r., are equal. By 
induction it may now be shown that for n tests we have 


NY cz . f Formula for predicting validity 


AS ————————— 
ES n+n(n—1)rz of lengthened tests } (51) 


A test with a reliability of .7 and a validity coefficient of 
-6 would, upon being made three times as long, have a validity 
coefficient of * 3x6 


= SS i 
NEE ace 
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An interesting agreement between psychological theory and 
statistical analysis may be noted in connection with formula 
(49). Suppose several tests are to be pooled for the measure- 
ment of intelligence. Some psychologists would select tests 
which correlate high with a criterion and low amongst them- 
selves so as to obtain not only those which are most valid but 
also those which measure as wide a sampling of intellectual 
abilities as possible. Psychological theory would then require 
a pool of tests with high values 7,;,, 1'cz,, Tez,, ete., and low values 
Yz,2, etc. Very fortunately, such a combination will produce a 
high validity coefficient for the combined tests, as may be seen 
from formula (49). High coefficients, r,., will give a large nu- 
merator, and low coefficients, r,,, will give a small denominator, 
both acting in the same direction to produce a high validity 
coefficient for the pooled tests. 

Another interesting application of correlation occurs in con- 
nection with the scoring of multiple-choice tests. The general 
formula advocated to correct for the element of guessing is 


1 iple-response 
Sp Rix (nals soeje LE eioite frtal \ (52) 
where S is the score, R the number of right responses, W the 
number of wrong responses, C a constant, and n the number of 
choices. Thus if the examinee is to underline one of three sug- 
gested answers, he would be scored by the formula S = R — 4 W. 
Such complicated scoring methods may be avoided entirely 
if all pupils be allowed to finish the test. In this case, if A = the 
number of attempts, R-+ W = A=a constant. We may also 
write S=R+C(R— A), or S=aR+ 5) where a and 6 are con- 
stants. The correlation between S and R will now be perfect, 
so that the number of ‘‘rights” furnishes as reliable and valid a 
score as the full formula. The proof that rsz = + 1.00 is left as 
an exercise for the student. (See Exercise 8.) 


— 


172 STATISTICAL METHODS IN EDUCATION 


8, THE EFFECT OF SELECTION UPON CORRELATION 


If the correlation between two traits, X; and Xo, is given by 
r12 with a sample of N and selected values are chosen, reduc- 
ing the size of the sample to N — n, then the resulting correla- 
tion Riz will differ from that obtained for the unselected group. 

Professor Pearson * has shown that if a; denotes the variabil- 
ity in X1 before selection and 2; the variability after selection, 


then ; 
par rie Correlation 53 
Ri2g= pm Ait a aaa Sen after selec- ( ) 
1 — rie + Tie = ies 
O1 


The correlation Ri2 decreases with 21, so that restricting the 
range of X, lowers the original correlation. 

As an example, let us assume that the correlation between two 
traits is given by rig = .7, and that values of X; are taken so 
that 0; = 10 is reduced to 2;=5. Substituting these values 
in equation (53), we find Riz = .44, which is considerably less 
than the correlation before selection. 

In case there is selection in both variables the adjustment 
formulast become very complicated. The beginning student 
will do well to avoid problems involving such correction. until 
he is in a position to read the papers cited in footnotes below. 


9. THE EFFECT OF RANGE OF TALENT UPON CORRELATION 


The magnitude of correlation coefficients clearly depends 
upon the particular group studied. Thus, “‘to secure a reliabil- 
ity coefficient of .40 from a group composed of children in a 
single grade is probably indicative of greater, not less, reliability 
than to secure a reliability coefficient of .90 from a group com- 


* Karl Pearson, ‘“‘On the Influence of Natural Selection on the Variability and 
Correlation of Organs,”’ Philosophical Transactions of the Royal Society of London, 
Series A., Vol. CC, p. 28. 

+ Karl Pearson, ‘‘On the Influence of Double Selection on Variation and Cor- 
relation of Two Characters,” Biometrika, Vol. VI (1908). 
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posed of children from the second to the twelfth grades,’ as 
shown by Professor Kelley.* 

This difference in the value of the obtained correlation is due 
to what Kelley calls “range of talent,” and he has given a formula 
(see equation (116)) for adjusting coefficients for varying ranges of 
talent. The proof of the formula, however, is open to some ob- 
jections and it is probably better, therefore, to compare correla- 
tion coefficients only when they have been obtained from the same 
group or from groups varying but slightly in range of talent. 

As a general caution it may be noted that it is not safe to 
compare correlation coefficients of any sort obtained from 
groups where the range of talent or other conditioning factors 
such as range in age are very different (see Chapter XV). 


EXERCISES 


1. Make correlation tables for Otis with Terman and for Chicago 
with Terman tests from the data of Exercise 1, Chapter II. Use inter- 
vals of 69.5—-79.5 ete. for Otis and Terman, and 29.75-34.75 ete. for 
Chicago. Work out the coefficients of correlation. 

(ror = -7183 rep = .681. Ans.) 


2. Make a correlation table for the two spelling tests of Exercise 6, 
Chapter II, using intervals of 5 units for both tests. Work out the 
correlation coefficient. (r= .963..478-) 


3. Compute the means of the columns and the means of the rows 
from the table of Exercise 2, and plot them on graph paper. Calcu- 
late the equations of the two regression lines and plot on the same 
graph. Determine the two probable errors of estimate. 

(A = 1.01B — 2.28 44.04; B= .92A + 6.63 + 3.85. Ans.) 


4. Calculate the correlation coefficient, regression lines, and prob- 
able errors of estimate for the table on page 174. Compute the 
means of the columns and rows and plot with the regression lines as 
in Exercise 3. The values of the constants are given below the table. 


5. Compute the correlation coefficient for the table on page 175. 


*T. L. Kelley, ‘‘The Reliability of Test Scores,” Journal of Educational Re- 
search, May, 1921, p. 374. 
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6. Work out the correlation coefficient for the first 25 pairs of 
scores for the data of Exercise 2, using the method described in sec- 
tion 8. Compare the amount of arithmetic with that involved in the 
use of the correlation table. 

7. The experimental reliability coefficients found by lengthening a 
spelling test from one to ten times the original value were: .850, .903, 
.927, .946, .960, .970, .974, .976, .980, and .981. Calculate the corre- 
sponding theoretical coefficients from the Spearman-Brown formula, 
using 7i7= .847 and n=1, 2,3 -+--+10successively. (Data furnished by 
Professor G. M. Ruch.) j 

(.847, 917, .943, .957, .965, .971, .975, .978, -980, .982. Avs.) 

8. Work out the proof for the exercise suggested at the end of sec- 
tion 7. 

9. Prove that the correlation between aX +b and cY #d is the 
same as the correlation between X and Y where a, 6, c, and d are 
constants. 


CGHAPTE REX 
NON-LINEAR CORRELATION 
1. THE CORRELATION RATIO 


As pointed out in Chapter IX, when the means of the arrays 
do not lie fairly closely on a straight line, the regression is to 
be regarded as non-linear. The correlation coefficient, which 
measures the approach to functionality only when the traits 
have a linear relationship, 
will give an understatement 
of the degree of association 
present for such curvilinear 
trends and is therefore inap- 
plicable. An extreme case 
of this understatement is 
illustrated in Fig. 43, where Fic. 43. An extreme case of non-linear 
all the observations lie on a correlation 
half circle. The correlation 
as defined .by approach to functionality will be perfect, but 
the product-moment coefficient will give zero as the amount of 
association. This may be readily verified by noting that from 
the symmetry of the points, xy will equal zero. 

In order to measure the correlation for non-linear tables, 
Professor Pearson has devised a coefficient known as the corre- 
lation ratio. The meaning of this coefficient may be shown by 
returning to formula (87) for the standard error of estimate 
of y on x. Rearranging the terms in this formula, we have 
Si Correlation coeffi- 
=e cient in ratio form (54) 
where S, is the standard deviation of the differences y — 9, or 


residuals from estimation by the regression line 7 = mx. The 
177 
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coefficient of correlation was derived on the assumption that 
the means of the columns, 7,, and the means of the rows, Z,, lie 
on their respective regression lines. 

In case the means of the columns, 7,, do not lie on a straight 
' line, the residuals y — % may be replaced by the differences 
Y — Jz, whose standard deviation is denoted by oa,. The correla- 
tion ratio for the means of the columns may then be defined as 


2 
w= -% 


y 


and, for the means of the rows, as Correlation ratios, 
original form 


ie (55) 


2 

Nw = j1- as (56) 
From Fig. 44 it is apparent that the differences y — 7, and 
their standard deviation o,, measure the extent to which the 
points in the scat- 
ter diagram are con- 
centrated about the 
irregular regression 
curve. When all the 
points in the diagram 
are located at the 
column means, the 
differences y — 7, and 
Ca, Will be zero, giv- 
Ing Aye=1; but when 
there is any scatter 
in the arrays, ay will 
not be zero, and 7, will be less than 1. The correlation 
ratio thus measures the approach of the data to any single- 
valued function, while the correlation coefficient indicates the 

closeness to linear functionality. 
It is further evident that if the regression is linear, y — 7 
= Y — ¥z for all the columns, so that S,=o.y,,and r= 7n,,. If the 


Fic. 44. Illustrating the correlation ratio 
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regression is not linear, 7, = %+d, or y—Jrted=y-YJ. 
Squaring both members of this last expression, summing over 
the whole table, and introducing f, as a symbol of operation, 
we have 


Since the middle term on the left is zero for each column, this 
reduces to a2, + 04 = S. (57) 
By combining equations (54), (55), and (57), we finally obtain 


OL — = ob aerate} (8 
This proves that 7,. is always greater than or equal to r, since 
oa is a positive quantity. ‘The same reasoning might of course 
be applied to nz». 

From the above discussion it is apparent that the single 
measure of association furnished by the correlation coefficient 
may be replaced by the two correlation ratios which are always 
numerically greater than r. The correlation coefficient fails to 
measure the full amount of association in case the regression is 
not linear, and should not be used unless the departure from 
linearity is negligible (see section 4). 


2. MODIFIED FORMULAS FOR THE CORRELATION RATIOS 


Formulas which are more convenient for computation may be 
obtained by modifying equations (55) and (56) and introducing 
the methods and notation of Chapter 1X. The quantity y — y,, 
when squared and summed over a column, gives 


ZY — Yu)? = By? — 2 Z'yGo + BGs? = Z'y? — fen”, . 


where the primes denote summation over a column. 
Summing next over the whole table, we find 


ZU"(y — Yo)? = VD"? — Thr”. 
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Denoting the standard deviation of 9, by o7,, we then have 


a2, = 07 — Orr. (59) 


If this result is substituted in equation (55), we obtain 


(69) 
quotients of two 


= os dee 
Nux = (Correlation ratios as 
aa : 
Ny = = A ES deviations J (61) 
on ’ 


and, similarly, 


The correlation ratio is thus the quotient of the standard 
deviation of the means of the arrays divided by the standard 
deviation of the whole table. It should be noted that in forming 
0;,, the deviations are weighted by the frequencies of the arrays. 

We shall next modify formulas (60) and (61) so that the cal- 
culations may be carried out with the variables taken from 

‘arbitrary origins as in the formulas of Chapter IX. 

The first formula may be written 


[x Ma ay 
Clb " Correlation ratio | 


Vz = —_ } + formeans 01 col; (G2) 
ems 


Cy 
where M, = A,y+ foro 
and Y,= A,+ Seele, 


We therefore have 
M,—Y,_ Zfdy_ Z'foytly 


i Nin saa 
Te = he ay 


Summing over the columns and then over the whole table, 
we find 


2 My = Fa? = (2h) (hte) copy + 3| Cte aa), 
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since Za’ ae <2) frie 
Upon collecting terms we may write 
2fn{ My — Y,)? =» eaatmala (2fydy)? 
5 iE N 


=€@ 


a 


and, similarly, 
2fy(Mz Tae A = Ean oa (Dhed,)2 = d 
h? fs N 


It has already been shown in Chapter IX that 


o.= (eye and o, =(,/5)é. 


Substituting these results in formula (62), we have 


e 
Nyx = NE (63) 
C Correlation ratios for 


d correlation blank \ 
and, similarly, Lay = rs (64) 


These last formulas are especially convenient when the correla- 
tion coefficient and ratios are to be compared as is often neces- 
sary. The quantities b and ¢ and the corrections for d and e are 
obtained in working out the correlation coefficient. The re- 
mainder of the computation for d and e may be readily done 
with the aid of the special correlation sheet shown in Table 35. 


8. A COMBINATION FORM FOR THE CORRELATION COEFFICIENT 
AND RATIOS 


In Table 35 the full computation of the correlation ratios is 
shown. It will be noted that this form differs from the one 
shown in Table 31 in that two additional columns and rows for 
the calculation of the ratios are given. 

For the items on the row headed (2'fz,dy)? it is only neces- 
sary to square, by means of tables, the quantities (2’f,,d,) 
already found in calculating 7. In the last row these squared 
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items are divided by the corresponding frequencies f,, the total 


sum being al The correction to this last quan- 


tity is ae also known from previous work. Making the 


correction, we obtain e = 10,827, and by a similar calculation for 
the rows we determine d = 11,817. The remainder of the com- 
putation may be readily done with the aid of logarithms as 
illustrated on the sheet. 

It will be noted that the computation for 2f,,d.d, has been 
checked by working out this quantity from both the columns 
and rows. Other important checks, such as 2[2’f,,dy] = Zf,dy, 
should be noted on the sheet and carefully observed in the cal- 
culations. 

The value for 7, comes out as .6191, while 7,, is .6401. The 
former ratio is in close agreement with the correlation coeffi- 
cient, r = .6120, on account of the linearity of the means of the 
columns. The coefficient 7,,, however, is somewhat larger than 
r because of the irregular regression curve for the rows. 

In case the means of the arrays are required for plotting they 
may be readily found by use of the formulas 


Yy x xy¢x 
Xy = Ax + aos ae )r Means of the) (65) 
¥ {se acon} 
and Yx = Ay + Coad k (relation table (66) 


where A, and A, are the assumed means and h and k the ‘class 
intervals for the variables X and Y, respectively. The quanti- 
ties 2’f,,d, and X'f,,d, may be taken directly from the correla- 
tion sheet. For the means of the columns, we should thus have 
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4. TESTS FOR LINEARITY 


In order to determine whether or not a correlation table is 
sufficiently linear so that r may replace 7 as a measure of asso- 
ciation, a test for linearity known as Blakeman’s test may be 
applied. The observed difference between 7,, and r for the 
data on page 182 is .6191 — .6120 = .0071. With asimilar table 
it appears quite likely that this difference might be zero. 

The tests of Blakeman which we shall use here may be ex- 
pressed in the following form: The difference between 7 and 7 is 
to be regarded as insignificant, provided that 


pr <2 Gea waa 
q Cae 72) en?) 2)? 1 (G7) 


{Blakeman’s test for linearity} 
or, if 7? — r? is small in comparison with r, 
VN V7? — 72 < 4.047. (68) 
{Blakeman’s short test for linearity} 

A full discussion of such sampling tests will be found in Chapter 
XIII, but for the present we shall merely illustrate the above 
- rules by applying them to the university and high-school corre- 
lations found in the preceding section. 

For the coefficients n,, and r we have, upon sabstituting’t the 
necessary values in formula (67), 


NIM Sy seacere area 
00874 < Fyay V.00874{.989} = .00911. 


Using formula (68), we have 

41.32 X .0935 = 3.86 < 4.047. 
By both tests, therefore, the regression is to be regarded as 
linear and the use of a linear equation for predicting university 
from high-school grades is justified. 

Applying formula (68) to y,, and r, we find 

41.32 x .1876 = 7.75 > 4.047. 
The regression in this case is non-linear, and for a full measure 
of the association, 7,, must be used. 
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For a small number of cases (say 50) it is frequently impos- 
sible to determine with certainty whether or not the regression 
is linear. With small tables, therefore, unless the regression is 
obviously curved, the calculation of r will be all that is required. 
With considerably larger bodies of data, however, the test be- 
comes important and the use of r should be justified by com- 
parison with both of the correlation ratios. 


5. A METHOD OF ELIMINATING THE EFFECT OF A VARIABLE 
UPON THE ASSOCIATION BETWEEN Two OTHERS 


If three or more correlated variables are involved, the asso- 
ciation between two of them for a fixed value of the third is 


xX 


Fic. 45. Illustrating formula (69) 


often required. In case the regressions are all linear throughout, 
the problem may be solved by the use of multiple correlation 
as shown in Chapter XV, but with non-linear relationships the 
solution becomes more difficult. 

The most direct and the best method for dealing with such 
problems is to correct the two associated variables for values of 
the third. The method may be illustrated for two variables, both 
having non-linear correlation with age. Designating these as 
X, Y, and A, the correlation r,, for A eliminated is required. 

It is first necessary to prepare the correlation tables for X 
with A and Y with A and determine the regression curves for 


NON-LINEAR CORRELATION 185 


Y on A and X on A as shown in Fig. 45. These curves may be 
drawn in free-hand, or fitted by the method of least squares as 
shownin Chapter XVI. Acertainage, A,,isthen selected for both 
tables, and all the values of X and of Y are corrected to this age. 

Let the ordinate of any observation in the table at age A, 
be denoted by Y;, and let Y, and Y, be the mean values of Y 
at A, and A, furnished by the regression curve. Then the re- 
quired value of Y, at A, will be given by the relation 


yon Y, rf (Y; + Y,). Corrective acer (69) 


for eliminating age 


From Fig. 45 it will be noted that this formula merely as- 
sumes that the growth in Y; from A; to A, is parallel to the 
regression curve between 
these points. The corrected 
variable Y, is thus the most 
likely value that Y; will 
have when the individual has 
reached the standard age. 

The arithmetic is most 
easily done by preparing a 
table of values Y, for all 
ages and then applying for- Fic. 46. Illustrating formula (70) 
mula (69) to the observations 
at each age across the correlation table. Similar corrections 
may be made for the variable X, and all the results recorded 
on the tabulation cards. The correlation between the corrected 
values Y, and X, then gives a good approximation to the 
result that would have been obtained if all the subjects had 
been measured at the same age, As. 

In case the standard deviations of the arrays of ages are not 
equal, another correction may be made. Equal variability of the 
arrays across the table is described as homoscedasticity, and 
unequal variability as heteroscedasticity. The new correction, 
then, is for heteroscedasticity as illustrated in Fig. 46. 
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If an individual at A; is one standard deviation above the 
mean Y;, his most probable deviation at age A, will be one stand- 
ard deviation above the mean at standard age. Denoting the 
standard deviations of the arrays at A; and A, by a; and os, 
respectively, the corrective formula then becomes 


* ¢ Corrective formula adjusting 


pax, V Os; 7 
Ys= Ys + (¥t—Y¥1) , for age and heteroscedasticity f (70) 


In applying this formula it is necessary to work out the ratios 
~ for each age, multiply the result by the corrective factor 
t 


(Y; — Y,), and add to the value for Y,. 

The corrected values X, and Y, may now be correlated, and 
the result will give the relationship between these variables for 
the age eliminated. This is essentially what is known as a partial 
correlation between X and Y (for A fixed). In case the variables 
X and Y have linear regression with A, a partial correlation may 
be worked out by the use of a formula (see Chapter XV). 

It may finally be noted that the regression curves for the cor- 
rected variables X,and Y, may be non-linear and the correlation 
ratio required. Whatever measure of relationship is used, how- 
ever, the resulting association is freed from the effect of the 
third variable A. - 


EXERCISES 


1. Work out the correlation coefficient and the two correlation 

ratios for the table on page 187. Apply the tests for linearity. 

(r= — 8283 Yoy=.961; yz = .958. Ans.) 

2 and 8. Calculate the correlation coefficient and ratios for the 
tables on pages 188 and 189, and test for linearity. 


4. Show that the method for correction given by formula (70) is 
equivalent to equating standard scores at ages A; and As. 


* For an illustration of the use of this formula see a paper by the author, “On the 
Relation of Vital Capacity to Certain Psychical Characters,’ Biometrika, Vol. XVI, 
p. 139, 
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CHAPTER XI 
THE BINOMIAL DISTRIBUTION 
1. INTRODUCTORY 


As pointed out in the first chapter, the inductive side of statis- 
tical method is based on the theory of probability. The com- 
parison of results from different samples, inferences regarding, 
differences, and generalizations of various sorts are possible 
only by resorting to the theory of chance. 

So important is this aspect of statistical science that some 
writers * devote practically all of their treatment to the theory 
of probability. For an elementary course and for the non- 
mathematical student such extensive treatment is impossible. 
We shall therefore be content to present here some of the 
simplest ideas in this theory with the understanding that the 
student is urged to amplify his knowledge of probability by 
consulting such works as Keynes,+ Whittaker, t and Fisher. 

In the present chapter we shall take up certain elementary 
theorems in probability and discuss the chance distribution 
known as the point binomial. Certain properties of this series 
which are important in the theory of sampling will also be con- 
sidered. The binomial law also serves as a good introduction 
for the normal probability curve, which will be taken up in the 
following chapter. 

In order to remind the student of some of the algebra useful 
in the development of the point binomial we shall turn first to 
the theory of combinations. 

* Arne Fisher, The Mathematical Theory of Probabilities. The Macmillan Com- 
pany, second edition, 1923. 

+ J. M. Keynes, A Treatise on Probability. The Macmillan Company, 1921. 

t Whittaker and Robinson, The Calculus of Observations. D. Van Nostrand 


Company, 1924. 
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2. PERMUTATIONS AND COMBINATIONS 


Suppose that a group of 7 objects is given. Any set of r* of 
these objects, without regard to their order, is called a combina- 
tion of the n objects taken r at a time, and is denoted by the 
symbol ,C,. For example, the combinations of the first four 
letters of the alphabet taken three at a time are 


abe abd acd bed 


Since there are four of these, we may write 4C3 = 4. 
far the order of the objects be taken into account, the arrange- 
ments are known as permutationsj/and are denoted bye Pat hus 
the letters a, b, and c may be arranged in a row in the order 
abc, acb, bac, bea, cab, and cba, so that 3P3 = 6. In the case of 
four letters, each of the four combinations of three furnishes six 
permutations, so that the total number of permutations of four 
things taken three at a time is twenty-four, or 4P3 = 24. 

The general formulas for permutations and combinations may 
be shown to have the forms 


nP; = n(n—1)(n— 2)---(n—-r+1) (71) 
and Ca He is Oe ne =) oes) =" (72) 


The quantity r! is known as “factorial r” and means the 
product of all integers from 1 to r. 
Teis’also shown: in algebra that ,C,=,Ca_,,.so0 that ,C, 
= ,Co =1. This theorem will be needed in a later section. 
Applying the above formulas to four letters taken two at a 
time, we find 


4x3 
P2=4X38=12, and 4C2=———~=6 
* This ‘tr’? should not be confused with the correlation coefficient. It seemed best 


to retain a symbol in the theory of combinations because of its wide use by 
mathematicians. 
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These results may be easily verified by making all possible ar- 
rangements of four letters two at a time, 


ab ac ad be bd cd 
ba ca da cb db dc 


The student may also check the following numerical results 
by applying the above formulas: 


5P3 = 60 5C3 = 10 6P4 = 360 eC4 = 15 
6P2 = 80 6C2 = 15 10P3 = 720, 10C3 = 120 


8. ELEMENTARY PROBABILITY 


If an event may happen in h ways and fail in k ways, and if 
each of the h+k ways is equally likely to occur, the mathe- 
matical probability * of the event happening is 

h 

p= ia (73) 
and the probability of its failing is 

oe aaa (74) 
It is evident that the probability of an event happening plus 
the probability of its failing is equal to 1, which is the mathe- 
matical symbol for certainty. The above results may also be 
expressed by saying that the odds are h to k in favor of the 
event happening, or k to h against its occurrence. 

Some of the simplest examples of such probability are fur- 
nished by the results of penny and dice tossing. Let us assume 
that the penny is a homogeneous disk and exclude the possibility 
of its standing upon an edge or sticking in a crack. If the turn- 
ing up of the head is regarded as a successful event and the 
turning up of the tail as a failure, it isevident that p= q=4. In 
the case of the die, the turning up of the ace might be considered 


* We are not concerned here with the various types of probability discussed in 
such treatises as Keynes, op. cit. 
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a success. Since there are six equally likely ways for the die to 
fall, the probability of the successful occurrence of this event is 
g, and that of its failure is 3. It should also be noted that the 
odds are even of turning up a head with a coin, but are five to 
one against the turning up of an ace on a die. 

If several pennies or dice are used, the resulting tosses are 
considered as compound events, and the occurrences of the indi- 
vidual events are regarded as entirely independent of one another. 
Thus with two pennies, a toss resulting in two heads is a com- 
pound event, and the fall of one penny is not influenced in any 
way by the fall of the other. 

It should be observed that the same results will be obtained 
whether we deal with the occurrences of a number of similar 
events, or with several trials of the same event. This may be 
illustrated in the case of penny-tossing. The various tosses 
which occur with three coins are the same combinations that 
arise when one penny is tossed three times in succession and the 
individual occurrences then combined. A compound event may 
therefore be obtained by several trials of a single event. 

The probability for the occurrence of a compound event such 
as all heads on three successive trials with a penny (or from one 
toss of three pennies) may be obtained by applying the defini- 
tion of probability given above. The number of equally likely 
ways in which the coin may fall on the first trial is 2, and on 
each of the other two trials also 2, so that the total number of 
equally likely possible ways for the compound event to occur is 
2x 2x2=8. The number of favorable ways for the event to 
happen is clearly 1, so that the required probability is §. By 
similar reasoning it may be shown that if the probability of an 
event is p, the probability of its occurrence on all of n trials is p". 
In case we are dealing with a number of dissimilar independent 
events whose individual probabilities are p1, p2, p3 - - -, the prob- 
ability of their all occurring together is pi X po X p3° °°. 

This last theorem may be illustrated in the case of a penny, 
a die, and a deck of playing cards. The probability of turning 
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up a head on the penny, an ace on the die, and the king of 
spades from the deck on one trial for each is  X  X 33 = @2T: 
The probabilities of a complete set of compound events may 
be illustrated by examining the combinations which occur when 
a coin is tossed three times. If the coins are designated by 1, 2, 
and 3, while H stands for head and T for tail, the following 
arrangement of the eight different throws may be made: 


(1) T | T H T H H H 
(2) T TT HT H 
(3) T H T fF Hae H 


Of the 8 equally likely combinations, one is TTT, or all tails, 
while another is TTH, or two tails and a head. This latter 
compound event may occur in 3 different ways, however, so 
that the probability of its occurrence is 3. A complete set of 
such probabilities may then be set down as follows: 


Probability of 77T = 
Probability of TTH = 
Probability of THH = 
Probability of HHH = 3 


|4 Go| G0|c9 OO|b4 


A general expression for the above results may now be ob- 
tained by using the theorem for the probability of compound 
events. The probability that an event will occur on all of 
n trials is evidently p”. In the above problem this is ($)3 = 4. 
The probability that the event will occur m — 1 times and fail 
once is p”~!qg. This result, however, may occur in 7 different 
ways, as is evident from the illustrative problem. The complete 
probability for ~—1 successes and one failure is therefore 
mp”—1q. Next, the probability that in » trials the event will 
occur n — 2 times and fail twice is p"~?q?. But again, this may 
occur in the number of ways in which two things may be selected 
from 7, which is mee oi he: nC2. The total probability is there- 


fore ,C2(p”~°q”). Thus for three trials the probability that 
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there will be one success and two failures is 3:2 x Hele 


in this example. ee BNE ° 

Continuing in the same way, it is evident that the general 
expression for the probability of obtaining exactly r successes 
and (n — r) failures is given by »C,;p’q"~". 


4. THE BINOMIAL THEOREM AND THE POINT BINOMIAL 


The binomial theorem may be written in the form 


(a + 6)? = a" + na"—1b,+ ne) 


n(n — 1)(n — 2) 
1-2-3 

The expansion on the right is the general result of multiply- 
ing out (a+ b6)(a+6)(a+b) to n factors. By making use of 
the notation for combinations, a more convenient form of this 
expansion may be obtained : 

(a+ b)”=7nCoa"+ nCia"—!b+nCea"-2b? + ,Cza"-3b8 +: - + +nCnb™. (76) 

Applying this theorem to (q+ p)", we have 

(q+p)"=nCog" +n Cig” 1p +nCoq” 2p? +nCsq” 3p? +: -*+nCpp", (77) 


{Point binomial} 


qr- 2b2 


+ a” — 363 +.---4 57, (75) 


the terms of which agree with the general expression for the 
probability of r successes found in the preceding section. The 
conclusion then is that 7f n trials be made of an event for which 
the probability of occurrence is p and the probability of faclure is q, 
the probabilities of 0, 1, 2, - - -n successes are given by the successive 
terms in the expansion of the binomial (q +p)”. 

As an illustration of this theorem the thirteen terms of the 
binomial (3 + 4)!2 are worked out in Table 36 on page 196. 
These are the probabilities of getting 0, 1, 2, --- 12 heads when 
one coin is tossed twelve times or twelve coins are tossed once. 

It is apparent from these results that the probability of getting 
all heads or all tails is very small. If twelve coins were used, only 
about once in 4000 throws would such an event occur. 
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TABLE 36. ILLUSTRATING THE PROBABILITIES OF OBTAINING 0, 1, 2... 
HEADS IN TOSSING TWELVE COINS 


SuccESSES (HEADS) PROBABILITIES 
(hl ee ee Sy ae teeta een. aost = .000244 
10 ee Rene Ree Chace ee ea Gayor of zoge = -002930 
YD pl te! eed EA BIC ONTO | PES aD aose = .016113 
Sees Bick. ie stele it eee en isos = .053711 
A eee he cee ops nt oe sts, ae ge eure sens fies = .120850 
SARS eG Ro tcaetes® Ree RCA Cee = > acaie eoleic Zoe6 = .193359 
Gime oes Te ae ie Aes eee 4056 = hes 
bb Prarie oS bee = .193359 
Scarce a oe irs. \/5t Zee "120850 
QR es vd Bo 1 ae tse Asante), oe) eee ae dees = .058711 
Apes ee ee ie Me Se oe ate, = .016113 
Dt Arn eee Cees) el ce ee aus = -002930 
1 Ze Sate gt Am eha, ERROR eeu ramet aose = -000244 
FLOCaL <, 0 Neate SoM ATS Sacteeek Ey a 1 1.000000 


The expression (¢+p)" is often called the point binomial, 
since its expansion is represented by a series of isolated points. 


Expected occurrences in 4096 trials 


0 


. Fic. 47. Plot of the binomial (4 


228 4.5. 6 .7) 89420 11 2 


Successes 


ye 


In Fig. 47 these points 
have been connected by 
straight lines, forming 
a polygon very much 
like the normal curve 
in general appearance 
(see Chapter XII). 

It has already been 
proved that the prob- 
ability of a specified 
number of successes is 
given by the appropri- 
ate term in the point 


binomial. Another important result is that the probability of 
an event occurring r or more times in n trials is the sum of the 
terms in the expansion of (q+ p)" from »C,q"—"p" to »C,p” tnclu- 
sive. This follows from the fact that the n + 1 compound events 
are mutually exclusive, or such that the occurrence of one com- 
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bination excludes, for that throw, the other m possible arrange- 
ments. As shown in algebra, the probability that some one or 
other of such mutually exclusive events will occur is the sum of 
the probabilities of the separate (here compound) events. 
Thus if twelve coins are thrown the probability of obtaining 
nine or more heads on a single toss is the sum of the probabilities 
70906 + Lose + Tose + F096 = 4098) OF 073. ‘This result may 
also be worked out by noting that of the 4096 equally likely 
arrangements of the 12 coins there are 220 ways in which nine 
heads may turn up, 66 ways in which ten heads may occur, 12 
ways for eleven heads to appear, and 1 way in which twelve 
heads may be obtained. This gives a total of 299 ways in which 
at least nine heads may appear, and the probability for such 
an occurrence is 7'5’g from the definition of simple probability. 


5. THE MEAN OF THE POINT BINOMIAL AND ITS 
STANDARD DEVIATION 


_ We shall next prove two interesting theorems in connection 
with the point binomial. These are known as the theorems of 
Bernoulli and are of great importance in statistical theory. 
The mean of the point binomial rs np, and its standard deviation 
as Vnpq. 
In proving the theorems, M and a are calculated as follows: 


TABLE 37. CALCULATION OF M AND o FOR THE POINT BINOMIAL 


SUCCESSES FREQUENCY d fd fd? 
0 qr 0 = = 
1 ng” — lp 1 GOSS G nq” —1p 
2 mn —D gn—2p2 2 n(n — 1)q?-2p2 2 n(n — 1)qr-2p2 
3 n(n — 1)(n — 2) Aer B OCT NDS) rye 3n(n—1)(n—2) 55 
=" SE DCIE Te ee mea, Stoo > st oeriree ee 


Neue 
(et Sy Le 
Tehsil 
eet 

WSLS if! 


= 


Totals . . 1 np np t+pm—D) 
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The sum of the frequencies is (¢-+ p)”, or unity, and Zfd may be 
readily factored into 
as LPO\ ae 
LEM OE gM COR IB Fes ae Leg ey | 
= np(q + p)"~! = np. re 
We may now apply the formula for the mean, M= A+ ee h. 
Since A= 0, N=1, 2fd = np, and h = 1, the mean of the bino- 
mial becomes Meanie: ue point 
HEY. binomial } (78) 


In order to obtain the standard deviation it is necessary to 
find fd? for the above series. The last column of items in 
Table 37 may be factored as follows: 


Dfd? = np Ga + 2(n— 1)q">*p + 28S Se qq" 3p?-+- ‘ | . 


The terms in the brackets may now be broken up to form two 
series in (q+). Thus, 
Dfd2 = np be + (n—1)q"-2p+ ed Me +} 


+{(n alge Pola Sa ead ihc -}| 


=npl+p)* + (n— 1) pgs 2+ — 2)08 peel 
=npl(¢ +p)" -+.(n— type p)*7] 
= np[1+ (n — 1)p). 


Applying formula (17) for standard deviation, we find that 


g = PPLE BPI _ 9292 = Vp — py, 


or o = npg. {Standard deviation of the point binomial} (79) 


The above formulas make possible the complete description of 
certain distributions given bychance. The terms in the series fur- 
nish the ordinates of the curve, while the mean and the standard 
deviation from the formulas (78) and (79) are convenient meas- 
ures of the central tendency and dispersion of such a distribution. 
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For the binomial (4 + 4)!” the mean and the standard devia- 
tion by these formulas work out at 6 and V3 respectively. In 
the case of such a symmetrical series, the mean is of course 
obtained by inspection. 

If twelve dice are thrown and the turning up of an ace is 
considered a success, the probabilities of 0, 1, 2---12 suc- 
cesses are given by the terms in the expansion of (¢+4)”. 
This series is distinctly skew, but the mean and the standard 
deviation are readily found to be 2 and Re on applying for- 
mulas (78) and (79). Practical evidence of the convenience of 
these formulas may be obtained by working out the same 
results directly from the frequencies. 


6. EXPERIMENTAL VERIFICATION OF THE BINOMIAL LAW 


In order to see whether or not the actual results of penny and 
dice tossing come out as predicted by the above formulas, it will 
be interesting to cite one or two examples. While such experi- 
ments serve to verify in a rough way the properties of the point 
binomial, it should be noted that strictly speaking they are not 
verifications at all because the conditions implied in the for- 
mulas can never be met on actual trial. The perfectly homo- 
geneous penny or die does not exist, nor is it possible to make 
the tosses so that certain throws are not favored over certain 
others. Differences between the observed trials and the theo- 
retically correct results will then be due not only to the number 
of trials or size of the sample but to imperfections in the 
objects thrown, and to faulty methods in tossing them. The 
student is urged, however, to make a few personal experiments 
such as those quoted below in order that he may become more 
familiar with the meaning and practical utility of the bino- 
mial law. 

In the following experiment twelve dice were thrown 4096 
times, the method being to roll them down an inclined gutter 
of corrugated paper. A throw of 4, 5, or 6 was considered a 
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success, so that p =q=+%. The theoretical mean will then be 
np, or 6, and the standard deviation Vnpq, or 1.732. The follow- 
ing table gives the observed and theoretical frequencies. 


TABLE 38. OBSERVED AND THEORETICAL FREQUENCIES OF 0, 1, 2... 
SUCCESSES FROM THE TOSSING OF TWELVE DICE WITH THROWS OF FOUR, 
FIVE, OR SIX AS SUCCESSES 


ERV THEORETICAL BSERVED THEORETICAL 

Se eg acer Penoteey gees . seer: ane 
Ole. _ 1 7 847 792 
jh Be tf 12 8 536 495 
2a 60 66 Soe. = 257 220 
ir ie 198 220 LO Aree re el 66 
Ay 430 495 1 ee aa! 12 
Bye 731 792 WOE ee le es 1 

G4 948 924 

Rotalseae 4096 4096 


The mean of the observed distribution is 6.139 and its stand- 
ard deviation is 1.712. The actual proportion of successes is 
0.512 instead of 0.5. The agreement, on the whole, is there- 
fore rather good. 

In the next experiment a throw of a 6 was considered a success, 
so that p=, and g=32. The theoretical mean is 2 and the 
standard deviation is 1.291. The observed frequency distribu- 


tion was as follows: . 
TABLE 39. OBSERVED FREQUENCIES OF 0, 1, 2 ... SUCCESSES RESULTING 
FROM THE THROWS OF TWELVE DICE WITH THE TURNING OF A SIX AS 
A SUCCESS 
SUCCESSES FREQUENCY SUCCESSES FREQUENCY 

Omer S co ee Kook: 447 Ae Pee E be soe) eS 115 

a te eae Rae ee 1145 CPt Ss nese 24 

2B ante iultalne- tek kes 1181 i Me See eee eee eed 7 

SAME ci ee Sa ee Wee 796 Si Ge DUS kX: er ee ae 1 

AP RES RR ER? Abi, lee ree 380 Wi eict Oeai: o |) ae 4096 


The observed mean is 2.000 and standard deviation 1.296, while 
the actual proportion of successes is .1667, agreeing with the 
theoretical values to an extent that is probably accidental. 
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The above results show that with careful extensive experi- 
ments such as these, the observed series is in good agreement 
with the binomial expansion. 


7. THE BINOMIAL APPLIED TO STATISTICAL DATA 


In the case of frequency distributions of observed data af- 
fected by many factors, the point binomial might often be used 
were it not for the large number of terms involved, and the 
difficulty of replacing the mathematical probability, known 
a priort, by an empirical probability ratio furnished by the data. 

As an illustration we may take the records of 400 candidates 
for the master’s degree in a certain university. Among other 
requirements it was necessary for the candidate to have an 
average of B— or better. For the present purposes, such an 
average may be considered a success, and a lower average may 
be regarded as a failure. Out of 400 candidates 331 maintained 
a satisfactory average, so that the empirical probability of such 
a success is $24 = .8275. It should be noted that such a ratio 
might change considerably from time to time, and would also 
tend to be unstable when applied to small numbers. We cannot 
expect, therefore, to get as good results from such empirical 
ratios as from the probabilities in the case of penny-tossing. 

The average number of candidates coming up at one time was 
about ten. Taking this number as the size of the sample (cor- 
responding to the number of coins tossed) the point binomial 
(.1725 + .8275)!° might be used to determine the probability 
for any number of successes, say nine or more. 

The terms in this binomial (computed by logarithms) to- 
gether with the results actually found by trial are given in 
the table on page 202. The probability of getting 9 or more 
successes in a sample of 10 is the sum of the probabilities .314 
and .150, or .464. The expected number from 400 candidates 
will, therefore, be 400 x .464, or 186. This result happens to be 
in close agreement with the observed number, (6 + 13)10= 190. 
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TABLE 40. OBSERVED AND THEORETICAL FREQUENCIES FOR THE NUMBER 
OF SUCCESSFUL CANDIDATES FOR THE MASTER’S DEGREE, IN SAMPLES 
oF TEN, THE TOTAL NUMBER OF CANDIDATES BEING 400 


SUCCESSFUL CANDIDATES 


OBSERVED 


THEORETICAL 


PROBABILITIES 
OUT OF TEN FREQUENCY FREQUENCY 

LO rae Oh 8 oP liste Gee clats eae 6 6.0 .150 
OO tans saben Sle we wee Cram 13 12.6 .314 
Sitiivd we Moberly A) yrs atte Sh cts eran 12 11.8 294 
Y gee Mr) Wt eke ae ee ff 4 6.5 164 
(ee We eon SER NS ee 0 2.4 -060 
iter torent Lae, eee ce ec 1 0.6 -015 
dig co “Nee eR tg Ry roy ae 1 0.1 .003 
SF ae Nee ee hes Gate TI ae 0 0.0 -000 
'LotalM ea eaae, ee ee 40 40 1.000 


Frequency 


Successful candidates out of ten 


Fig. 48. Comparison of theoretical and 
observed frequencies for candidate data 


The complete set of 
theoretical frequencies is 
found by multiplying the 
probability values by 40. 
These frequencies agree 
fairly well with those 
given by the data as 
shown in the above table 
and in Fig. 48. 

Further evidence of the 
agreement of the two 
series may be found by 
comparing the theoretical 
and observed standard 
deviations. The former 
is Vnpq, or 1.19, while 
the latter is 1.28. The 
difference, or 0.09, may 
be readily accounted for 
by chance fluctuations 
in sampling (see formula 


(91) and the testing of differences in Chapter XIII). 
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EXERCISES 


1. Expand the following binomials, and plot the results. 
(ee Gs) a tg) e298), C1 -+..9)1°. 
2. If the terms in the expansion of & a 5) 10 represent the proba- 


bilities of 0, 1, 2,3... 10 successes, find the probability of obtaining 
seven or more successes in ten trials. pots = -172. Ans.) 


3. Find the means and standard deviations for the binomials of 
Exercise 1, using formulas (78) and (79). Verify some of the answers 
by direct calculation from the full expansions of the binomials. 


4, From Table 41 of Chapter XII determine the empirical prob- 
ability of a man selected at random being over 714@ inches in 
height. Use the total distribution. (.039. Ans.) What is the prob- 
ability that a man’s height will be between 6642 and 6712 inches? 
(.155. Ans.) What are the odds in favor of the latter occurrence? 

(Approx. 4:1. Ans.) 


5. Suppose that a penny is tossed, a die thrown, and a card 
drawn from an ordinary deck. What is the probability of the com- 
bined event: head on the coin, ace or six on the die, and a heart 
on the card, with a single trial for each? (siz. Ans.) 


6. What is the probability of turning up a total of eight with 
two dice? gg. Ans.) 
7. If three cards are drawn from a suit of thirteen cards, what is 
the chance that both king and queen are drawn? ag. Ans.) 


8. Show that if np be a whole number, the mean of the binomial 
coincides with the greatest term. 


9. Derive formulas (78) and (79) by differentiating the expres- 
sion (¢ + px)” with respect to x and setting x = 1. 


CHAPTER XII 
THE NORMAL PROBABILITY CURVE 
1. INTRODUCTORY 


In the present chapter we shall discuss the properties and 
uses of the normal probability curve, the general form of which 
is doubtless already familiar to the student (see Fig. 51). 

An example of a distribution resembling the normal probabil- 
ity curve is furnished by the mental age data in Fig. 49. When 
these data are separated into ‘“‘normals”’ and “ defectives” two 
fairly symmetrical curves result. Burt* explains the lack of 
complete symmetry in the curve for normals on the ground 
that the Binet Scale lacks adequate tests for the brighter chil- 
dren of the older ages. He concludes that even though his data 
are somewhat irregular, they do not ‘“‘in any way contradict the 
hypothesis of ‘normality,’ the theory that ability is distributed 
in close conformity with the normal curve of error.” 

In the case of certain physical characteristics such as height, 
the normal curve appears to give an excellent fit to the observa- 
tions. The data in Table 41, quoted from Yule,} furnish a very 
good example. 

The histogram for the frequencies in the total column of the 
table is shown in Fig. 50, where the symmetry and general 
resemblance to the normal curve are apparent. 

The above examples suggest that the frequency distributions 
of some mental and physical traits conform fairly well to the 
normal curve. It would be far from correct, however, to assume 
that all human characteristics are normally distributed. This 
assumption was made by an early statistician named Quetelet. 

* Cyril Burt, Mental and Scholastic Tests, p. 162. King and Son, Ltd., London, 


1921. 
7 Yule, Introduction to Statistics, p. 88. 
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Fic. 49. Distribution according to general intelligence of children of ordinary 


elementary and special M.D. schools 


From ‘Mental and Scholastic Tests,” by Cyril Burt. Courtesy of 
P. S. King and Son, Ltd. 
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Fic. 50. Histogram for heights of 8585 men 
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TABLE 41. DISTRIBUTION OF STATURE FOR ADULT MALES BORN 
IN THE BRITISH ISLES 


HGyeiEre my TioriTe NUMBER OF MEN ACCORDING TO BIRTHPLACE ote 
OVE SROEY SBCs) England Scotland | Wales Ireland 
W6ie-Tlie - eee 1 iL — — 2 
753-1648 2. we 1 4 = = 5 
TAB -Thte le 9 6 i — 16 
T3t¢-TA4S ln 16 15 ils = 32 
(256-1856 6 1 wee 48 26 2 3 79 
T1B-T236 wes LG 69 6 10 202 
Niele ww wel, 254 102 21 15 392 
6958-7038 =. wee, 473 115 33 25 646 
6853-6958 . . 1. . 753 218 52 40 1063 
6778-6848 . 2... 886 210 72 62 1230 
664s-677 . .... 918 210 128 73 1329 
6533-6648 ..... 881 139 145 58 1223 
6448-6538 =. 2... 740 109 108 33 990 
6833-6448 2... 524 47 83 15 669 
6258-63838 ..... 320 19 48 7 394 
6143-6238 =. 1 wl 128 9 30 2 169 
603-6148 ww wee 70 2 $ 2 83 
59i8-6038 . 2... 39 2 = = 41 
58is-O9TSe ww ele 12 1 1 14 
5776-585 ww wel. 3 1 — a 
B6iZ-573B wk, 1 = i —_ 2 
Wotaltn mec. Canc 6194 1304 741 346 $585 


He pictured an average man with physical and social traits at 
the means of a series of probability curves. The work of such 
men as Pearson and Charlier, however, has since shown that 
these characteristics are best represented by a variety of curves 
among which the probability curve is a special type. (See sec- 
tions 8 and 9 of Chapter XVI.) 

It will be shown in section 5 that the resemblance of a fre- 
quency distribution to the normal curve cannot be satisfactorily 
determined by mere inspection of the data. A rigorous test of 
the normality of a given distribution involves the superposition 
of anormal curve on the data and a mathematical comparison of 
the observed and theoretical frequency. 
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The normal probability curve is very important in the field of 
educational measurements because of its usefulness in scale con- 
struction and in many calculations involving qualitative series. 
It is usually necessary in such problems to assume some form 
of distribution and the normal curve is taken because, of all the 
curves which might be employed, it gives the best single ap- 
proximation to the ordinary test score distribution. The mathe- 
matical properties of the probability curve, including tabulations 
of its integral and ordinate, make the calculations involved very 
much simpler than with some skew form of curve. 

Although no formal derivation of the normal curve will be 
given, its relation to the point binomial will be shown as well as 
its usefulness in the elementary theory of probability. 


2. THE EQUATION OF THE NORMAL PROBABILITY CURVE 


As already pointed out, the practical use of the point bino- 
mial requires a great deal of labor. If, for example, the samples 
in the problem of section 7, Chapter XI, had consisted of twenty 
instead of ten candidates, the terms in the binomial (q+ p)?° 
would have to be computed. 

An important simplification of the binomial law may now be 
reached by allowing the size of n to increase indefinitely. It is 
obvious from the binomials discussed thus far that as n becomes 
larger the resulting polygon over the n+ 1 points becomes 
smoother and tends to spread out more and more in both direc- 
tions from the mean. The limit to the point binomial, (q + p)”, 
as n increases indefinitely, may be shown by mathematical proof * 
to be given by the continuous curve 


1 -= Normal curvet 
Bae pees nak (AS area = 1 } (80) 


* Yule, Introduction to Statistics, p. 301 (simple proof). : : 
+ The normal probability curve was first given by De Moivre in 1733 but was 
later rediscovered by Laplace and Gauss. 
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where e = 2.7183 - - -, which is the base of the Napierian system 
of logarithms, and 7 is the familiar ratio of the circumference 
of a circle to its diameter. 

Just as the sum of the ordinates in the point binomial (q+ 7p)” 


is equal to unity, so the area under this curve is equal to 1. 


: Lae 
From equation (80) it is evident that for z=0, y= Veam 


and that about the value 0, which is the mean, the curve is sym- 
metrical, because the same positive and negative values of x give 
a single value for y.. By writ- 
ing the equation in the form 


1 
Yo ae 


Vance 2% 


it is also apparent that no 
matter how large or small x 


oy is taken, y will never become 
Fic. 51. Normal curve with unit area equal to eT The curve is 
(if ¢ = 1) thus symmetrical about the 


mean at x = 0, and extends 

indefinitely in both directions, approaching the z-axis as an 
asymptote as shown in Fig. 51. 

In case the normal curve is applied to data for which the total 

frequency is N and not unity, the form of the equation becomes 


x2 x 


Normal curve 
with area=N (82) 


y gee Nice 
V2 10 


each of the ordinates for unit area being multiplied by N. The 


A IN : : 5 
coefficient Yipes is often designated as yo, and is the maximum 


ordinate at x = 0, since e° = 1 as noted in Chapter IV, section 4. 
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3. THE AREA, ORDINATES, AND DEVIATES OF THE 
NORMAL CURVE 


If the standard deviation of the normal curve be chosen as 1, 
for convenience, the equation then takes the form 


ons. ae of the normal curve, iia (83) 


a 5 Abe 
unit area and standard deviation 


The values of x, or number of standard deviations from the 
mean, are called deviates; zis the usual symbol for the ordinate 
at a given deviate; 
and 4a will be used 
to denote the area 
from the mean to 
such a deviate. These 
three functions of the 
curve have been com- 
puted and tabled in 
various ways, and are 
of the greatest impor- 
tance for a variety 
of statistical caleula- Fic. 52. Illustrating area, ordinates, and devi- 
tions. An illustration ates for a normal curve 
of these functions is 
given in Fig. 52, the numbers being taken from Table 42. It 
will be noted that for a deviate x = 1.5, the ordinate z will have 
the value .130, while the area from the mean, or 4a, will be 
43.3 per cent of the total area of the curve. 

The methods for calculating the areas and deviates are a 
part of the calculus, but the ordinates may be determined by 
merely substituting various values for x in equation (83). For 


1 
cer at when «= 0, z= 55066 = 38989. Similarly, when «= 1, 
ay soe (2.7188 —3 = 2420. 
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For certain problems, which will be taken up later, it has 
been found convenient to calculate and table these functions in 
two ways: 

(1) Areas and ordinates for given deviates, and 

(2) Deviates and ordinates for given areas. 

Complete tables for these values are found in Pearson’s* 
‘*Tables for Statisticians and Biometricians,” in Kelley’st 
“Statistical Method,” and in more abbreviated form in a hand- 
book prepared by the writer.t{ Two short lists of three-place 
values are also given 
in Tables 42 and 43. 

It is apparent that 
a is the area from 
—zxto+2, as shown 
in the accompanying 
figure. Whenx=-+1, 
a = .682, from which 
it follows that more 

/ than two thirds of 
=-1 ¢=0 =H the total area under 

Fic. 53. Illustrating a for a normal curve the curve is included 

between these limits. 
When « = + 3, a = .998, showing that a range of 6¢ includes 
more than 99 per cent of the frequency. It will also be noted 
that the ordinate at x = 3 is very small, being only 345 of yo, 
or .01 of the maximum ordinate at the mean. 

Table 43 for deviates and ordinates in terms of area from the 
mean shows that for equal increments of +a there is very little 
change in x and z in the neighborhood of the mean, but very 
rapid change toward the extremities of the curve. For $a =.50 
the ordinate is equal to zero, and the deviate is infinite. 


* Tables for Statisticians and Biometricians, edited by Karl Pearson. Cambridge 
University Press, England. Second edition, 1924. 

t T. L. Kelley, Statistical Method. The Macmillan Company, 1923. 

¢ Karl J. Holzinger, Statistical Tables for Students in Education and Psychol- 
ogy. The University of Chicago Press, 1925. 
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TABLE 42. AREAS AND ORDINATES FOR GIVEN DEVIATES FROM THE MEAN 


x 4a Zz aR 4a Zz 
0.0 -000 .399 2.1 482 .044 
0.1 -040 397 2.2 486 .035 
0.2 .079 ool 2.3 489 -028 
0.3 118 .381 2.4 492 .022 
0.4 -155 .368 2.5 494 .018 
0.5 allel .352 2.6 495 .014 
0.6 226 .333 2.7 497 .010 
0.7 -258 312 2.8 AQT .008 
0.8 -288 .290 2.9 -498 -006 
0.9 .316 -266 3.0 .499 .004 
1.0 341 .242 3.1 -499 .003 
Mell .364 .218 3.2 499 .002 
1.2 -385 194 3.3 -500 .002 
1.3 -403 ai lira, 3.4 -500 -001 
1.4 419 SND 3.5 -500 -001 
1.5 433 .130 3.6 -500 .001 
1.6 445 alhilil 3.7 -500 -000 
1.7 455 .094 3.8 -500 -000 
1.8 464 AW) 3.9 -500 .000 
1.9 AT1 -066 4.0 -500 -000 
2.0 ATT .054 4.1 .500 -000 


When 4a =.25 it will be noted that «= .674, or, more 
exactly, x = .6744898. This value, which is known as the prob- 
able error, is therefore given by the relation 


P.E, = 67448980. {BOpvo ee} (84) 


It is very frequently used as a unit of measurement on the 
normal scale instead of oc, chiefly because of long usage. 

It may also be observed that P. EH. and Q are the same for a 
normal curve, since exactly half of the area is included when they 
are laid off on either side of the mean. With actual data, P.E. 
will not be equal to Q, so that it is usually better to avoid the 
use of the term probable error in describing an observed fre- 
quency distribution. The term arose in connection with dis- 
tributions of error such as those in astronomical measurements. 
With ordinary data such a deviate does not represent an error 
and the term probable error is therefore a misnomer. 
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TABLE 43. DEVIATES AND ORDINATES FOR GIVEN AREA FROM THE MEAN 


7a x Z 4a x Zz 
-00 0.000 399 -26 0.706 soll 
-01 0.025 -399 20 0.739 .304 
-02 0.050 .398 -28 0.772 .296 
.03 0.075 .398 -29 0.806 -288 
.04 0.100 .397 .30 0.842 -280 
-05 0.126 .396 31 0.878 PAO | 
-06 0.151 .394 32 0.915 .262 
.07 0.176 .393 s30 a | 02954 .253 
.08 0.202 391 84 0.994 .243 
-09 0.228 .389 .35 1.036 .233 
-10 0.253 .386 .36 1.080 .223 
sltil 0.279 .384 .37 1.126 arf ed 
12 0.805 381 .38 1.175 -200 
13 0.332 .378 .39 EA | .188 
14 0.358 374 -40 1.282 ATS 
15 0.385 .370 41 1.341 .162 
16 0.412 -366 -42 1.405 .149 
17 0.440 .362 43 1.476 134 
18 0.468 -358 44 1-555 -119 
19 0.496 -353 -45 1.645 .103 
-20 0.524 .348 -46 eo: -086 
-21 0.553 .342 47 1.881 -068 
.22 0.583 .337 48 2.054 -048 
23 0.613 Boo 49 2.326 .027 
24 0.643 .324 -50 oe) -000 
25 0.674 -818 


4. COMPARISON OF THE POINT BINOMIAL AND THE ~ 
NORMAL CURVE 


The close agreement between the binomial series and the 
normal curve may be illustrated for the binomial ($+ 4)!6, 
the ordinates for which are given by expansion as shown in 
Chapter XI. 

In order to compute the normal ordinates at the 17 binomial 
points it is first necessary to calculate the values of the latter as 
deviates from the mean. Since the standard deviation of the 
binomial is Vnpq, or 2, for the above series, the deviates at 

0—8 1-8 
5 = — 4, 5 = — 3.5, 


0, 1, 2, 3, - - + successes will be 


Teh aes ete. 
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The ordinates of the normal curve for these deviates may 
now be looked up in Table 42, and divided by 2 in order to 
make them comparable with the binomial ordinates. The tabled 
values, of course, are for unit standard deviation. A complete 
list of the abscissas and ordinates for both curves may then be 
obtained as shown in Table 44. 


TABLE 44, ORDINATES FOR THE BINOMIAL (3 + 4)16, wITH CORRESPONDING 
NORMAL ORDINATES 


BINOMIAL és NORMAL NORMAL 
SUCCESSES ORDINATES i ORDINATES ORDINATES 
FORG = 1 FOR ¢o = 2 

0 -000 — 4.0 -000 -000 

1 -000 — 3.5 -001 -0005 
2 5 lant Se ae .002 — 3.0 -004 .002 
ORM see sree x : .0085 — 2.5 -018 -009 
AIS a ea SES Meee .028 — 2.0 .054 027 
DM uit tens: esha rs) ects .067 —1.5 130 -065 
Gara itc ts ta cet) ae 122 —1.0 .242 AlPAl 
fT ee ae 3 1745 — 0.5 .852 -176 
SS re ee ak ones .196 0.0 .399 Aly 
Oia als tsps a dasry si) sie. os -1745 + 0.5 .352 176 
NO Beet Cen, ealapic: (ae & Aly +1.0 -242 apa 

TAK ot Sa eG ane -067 +1.5 .130 .065 ! 
2M cs dyssetsy =. ssa cs: ae .028 + 2.0 .054 027 
MEE Mea taint ve, aa vais -0085 + 2.5 -018 -009 
AME ee 1a) pe: ton 3, -002 + 3.0 .004 -002 

INS piles ac) ROPE oe Cee .000 + 3.5 -001 -0005 
IO, dee Ae ee -000 + 4.0 -000 -000 
Motvaleee ny we) oy Saya 1.000 1.000 


From these values and by inspection of Fig. 54 it is apparent 
that the agreement between the two curves is very close. For 
more terms, of course, the discrepancies between the ordinates 
would have been even less than those found here. 

The equation of the normal curve here considered is clearly 

Yy =a e 
V8 
since it is only necessary to substitute o =2 in equation (80). 
The mean of the curve is set at 8 successes. 
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001 2 Sie bho Tee 29 Pubes ie eee 
Successes 
Fic. 54. The point binomial (4+ 4)!® compared with the normal curve 
x2 
weer; 8 


V80 


5. FITTING A NORMAL CURVE TO A FREQUENCY 
DISTRIBUTION OF DATA 


The method for fitting a normal curve to a series of observa- 
tions is similar to that just described, with the exception that 
areas and not ordinates are to be compared in determining the 
goodness of fit. The superposed curve is determined by taking 
its area, mean, and standard deviation equal to those obtained 
from the data.* 

The work may be illustrated for the distribution of I.Q.’s 
given in Table 20 of Chapter VII. The necessary constants, 
already worked out, are 


* For a more complete discussion of such fitting see Chapter XVI. 
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N = 4834, 
o = 1.686 X10 (1.661 with Sheppard’s correction *), 
M = 89.28. 


Using formula (82), the equation of the desired normal curve 
will be 


y= ee a AG al 
V2 7(1.661) 
It will be noted that the standard deviation is expressed in units 
of class intervals, which is necessary in order to make yo com- 


parable with the observed frequency in the interval at the mean, 
and bring the total area and frequency equal to N. 


TABLE 45. NORMAL ORDINATES FOR I.Q. DATA 


= SCALAR ABSCISSAS Zz y=—Xz2 

o o 

0.0 89.28 .399 1161 
+ 0.5 97.58 and 80.98 .352 1024 
+1.0 105.89 and 72.67 242 704 
+1.5 114.19 and 64.37 .130 378 
+ 2.0 122.50 and 56.06 .054 157 
+ 2.5 130.80 and 47.76 -018 52 
+ 3.0 139.11.and 39.45 .004 12 
+ 3.5 147.41 and 31.15 001 3 


+ 4.0 155.72 and 22.84 — = 


: 4834 
The value for Yo; when x = 0, 1S 2 BOGGS 661 we 1161. From 


Table 42 it will be noted that the value for z at x=0 is 
i! 3 ; : 
= .399. It is therefore necessary to multiply this and all 
W/74 aie 


of the other ordinates taken from this table by the factor 


v= 2910. Thus the ordinates at + 0.5 will have the values 
2910 x .352 = 1024, 2910 x .242 = 704, ete. 
The ordinates may be plotted at any convenient distances 


from the mean, say at multiples of 0.50, which must be worked 


* See Chapter XVI, section 8. 
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out in actual scale units. The value for yo will, of course, be 
taken at 89.28, while the ordinates at 0.5 ¢ will be located at 
89.28 + .5(16.61), or at 97.58 and 80.98, etc. A complete list of 
values is shown in Table 45 on page 215. 

A histogram of the observed frequencies and the fitted normal 
curve have been plotted on the same background in Fig. 55. 


Frequency 


1.Q. 
Fic. 55. Histogram for 4834 intelligence quotients with fitted normal curve 


The agreement, as judged by mere inspection, appears to be 
rather good, but this method of comparison is worth very little 
in determining whether or not a particular mathematical curve 
adequately describes a body of data. The accurate method is 
to compare the discrepancies in frequency (area) between the 
histogram and the curve and determine whether or not the dif- 
ferences may or may not be accounted for by chance fluctua- 
tions of sampling. This test for goodness of fit will be applied 
in the chapter on Sampling (section 7). 
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6. SOME PROPERTIES OF THE NORMAL CURVE 


From the fact that the normal curve is a continuous function 
it is now possible to find the probability for an occurrence be- 
tween any two limits, x; and x2. The actual frequency between 
these limits gives the number of favorable ways the event may 
happen, while the total frequency gives the total number of 
possible ways. The quotient of these two frequencies, or 


Frequency of occurrence between x; and x2 
ed ee) pe ee EWE eT OR 2 
Total frequency of all occurrences 


then furnishes the desired measure of the probability. 

In case the unit-area form of the normal curve is used, the 
denominator of this fraction becomes 1, and the probability for 
an occurrence between x; and zz is merely the area between 
these limits. 

This area, which is known as the probability integral, may 
be found by using the appropriate values of 4a given in 
Table 42 or in more extended tables such as Pearson’s. 

To illustrate the use of Table 42 in this connection, let us 
find the probability for an occurrence between lo and 2c. 
This is represented in Fig. 56 by the shaded area. From the 
table the area from x = 0 to x = 2 is found to be .477, while 
the area from x = 0 to x=1 is .3841. The required area and 
probability is therefore the difference between these two values, 
or .136. 

The same reasoning may be applied in the case of a distribu- 
tion of observed data such as the 4834 I.Q.’s. In order to find 
the probability of getting an I.Q. between 130 and 140 in such 
a group it is only necessary to divide 36 (the number of favor- 
able occurrences) by 4834 (the number of equally likely occur- 
rences), and obtain .0074 as the required probability. Thus, if 
the 4834 I.Q.’s were recorded on little tickets and mixed up in 
a box, the chance of drawing a card with I.Q. between 130 and 
140 would be .0074, or less than one in a hundred. 
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The probability integral is also useful in determining the 
chances that an occurrence will lie within or without a given 
middle range about the mean. Thus the probability (from 
Table 42) for an event between —3¢ and +<3<a is the value of 
a at x = 8, that is, 2 x .499, or .998, while the probability for 
an occurrence beyond these limits in either direction is .002. 
By more extended tables,* these two values are .9973002 and 


Vira 


“=0 X=1 X=2 


Fic. 56. Illustrating the area between «= 1 and x = 2 on a normal curve 


.0026998, respectively. The probability for an occurrence be- 
yond + 6a is .000000002, or only twice in a billion trials. 

In case the probable error is used as a unit of measurement it 
is possible to determine the probabilities for an occurrence be- 
tween the given multiples of P.E. when laid off on either side of 
the mean. Thus the chance of a deviate within +1 PE. is 4 
(by definition). A short table of such probabilities is given 
on page 219. 

Another interesting property of the normal curve makes it 
possible to find the mean of the portion between any two ordi- 
nates. Let the equation of the curve be taken in the form 


1 Ae y Ordinate of the normal curve, with (83) 
23 unit area and standard deviation 


* Pearson, Tables for Statisticians and Biometricians, Cambridge University Press. 
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TABLE 46. PROBABILITIES THAT A DEVIATE WILL LIE WITHIN CERTAIN 
LIMITS ON A NORMAL CURVE 


PE. PROBABILITY OF AN OCCURRENCE WITHIN A 

RANGE OF + A GIVEN MULTIPLE OF P. E. 
5 .264 
1.0 .500 
1.5 -688 
2.0 .822 
2.5 -908 
3.0 -957 
3.5 -982 
4.0 993 
4.5 ‘ .998 


with unit area and standard deviation ; let z, and zz be the ordi- 
nates at any two points x, and “ze, the second abscissa having the 
larger value; let :nz be the area between these ordinates; and 
let 122 denote the mean of the inclosed portion. Then it may be 
proved * that 


ZX, —21= 22, f Mean of a portion of a normal curve, 85) 
td iN with unit area and standard deviation ( 


Vm 
t=0 L=1 7=2| T=3 
= +2.3 
Fic. 57. Illustrating the mean of a portion of a normal curve between 
Cand c= 3 


* For any continuous function, z = f(a), the mean between the limits x, and 22 is 
x2 


given by ve vedx i ot zdx. In the present case, z= u 5 2 and a zdx = No. 
wy ry ( V20 vy 
aes x2 
The integral in the numerator may be readily evaluated, giving i= | , OF 2; — 2. 
xy 


py 
1%2 


Therefore ,%, = 
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This theorem may be illustrated by finding the mean of 
the piece included between ordinates at x=2 and r=83. 
From Table 42, z; = .054 and z2=.004. The value for m2 may 
be found by subtracting $a for x; from 4a for x2, that is to 


Fic. 58. Illustrating the mean of a portion of a normal curve between 
x=-—2andx=+1 


say, .499 — .477 = .022 gives the area 1m2 between the two or- 
dinates. The required mean for this piece is therefore 


7. Ete .054 a 004 
ee OTe 


or 2.8 standard deviations above the mean of the whole eurve. 
With Pearson’s tables we find 


Fy — 20539910 — .0044318 
0214002 


It should be noted that :n2 is always positive, and that the 
sign of :%2 is determined by the difference between the ordinates, 
which must be subtracted in the order indicated. Thus the 
mean of the piece between x = — 2 and x = 1 will be obtained 
by adding the two values for }q@ and subtracting the larger 
from the smaller ordinate, that is, from Table 42, 


= _ .054 — .242  — .188 : 
1v%2 = nS) Shiai => 818 = — 0.23 (Fig. 58). 


= + 2.3 (Fig. 57), 


= 2.31583. 
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7. REPRESENTING DATA ON A NORMAL SCALE 


In case we are dealing with a series of observations with 
standard deviation o, and total frequency N, formula (85) may 
be modified so that the inclosed area is a fraction of the total, and 
the mean is expressed in units of the standard deviation, that is, 

= pd Mean of a portion 
Laney ot 28 ‘0 a normal ne (86) 


a ifa with area = N 
N 


By means of the above formula it is now possible to represent 
a qualitative series of observations on a normal scale, assigning 
to each class the numerical value given by the mean of each 
sub-group. In this way the qualitative series has been converted 
into a quantitative one, the assumption being that the law be- 
hind the data is the normal distribution. This method is of 
the greatest importance because it makes possible the applica- 
tion of many formulas requiring numerical values for the classes 
(see Chapter XIV). 

Any other curve might be used to represent such data, but as 
indicated at the beginning of this chapter the normal curve is 
the best single approximation to most educational data, and 
very fortunately it is extremely simple to apply. 

As an example, let us represent the following qualitative se- 
ries on a linear and then on a normal scale. The data are general 
health estimates of school children made by several physicians. 


TABLE 47. HEALTH DATA WITH PERCENTAGE FREQUENCIES 


HEALTH OF CHILD G PERCENTAGE f 
IV. cryarODUStMermrmewE IT Gel teas, te fea tetas heey ol 16 2.0 
RObDUSt ete erate Rae tot core: (ahs) Ayah Baste ss 199 24.4 
Normal Spee eA op carta et cter Uses Satta 345 42.3 
Rather delicater mre a (mem ois ss el cee ste 115 14.1 
Delicater CTI cine beth es sds Au ah ae” 124 15.2 
Werysdelicatermm ame tere strc s) Sek oe Ses aa a 16 2.0 


815 100.0 
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If we assume that the attribute, health, is distributed with 
equal frequency along a scale, the resulting series will form a 
long rectangle. The mean of each piece, occurring at the middle, 
might then be taken as a numerical measure of the class. This 


0 815 


i 78 197.5 427.5 699.5 807 


VED eD: R.D. N. R. V.R. 
Fic. 59. Rectangular distribution of the, health series 


method, however, would be unsound because it assumes a form 
of distribution totally unlike any observed for such traits. 
Assuming that health is normally distributed, the series may 
be represented as in Fig. 60. It is now possible to determine 
the means of the various pieces by the use of Table 43. The 
need of such a table becomes apparent when it is noted that 
areas and not deviates are 
furnished by the data. While 
it is better to use more ex- 
tended tables, such as Kel- 
ley’s* or Holzinger’s, the work 
will be illustrated by Tables 
48 and 49, the figures in 
parentheses being obtained 
from Holzinger’s Table XII. 
Fic. 60. Representation of the health If the ordinates are desig- 
data on a normal scale nated as 21, 22, 23°: 27 itis 
evident that z, and 27 are 
zero. The other five ordinates, inclosing various pieces, may be 
obtained by reducing the areas to total unit area, and entering 
Table 43 with the proper value of 4a. Thus the area to the 
right of ze is .020, so that +a = .480; the area to the right of 
zs is .264, giving $a = .236; while the area to the right of z4 is 
.687, for which 4a = .187, as shown in Table 48. 


*T. L, Kelley, Statistical Method. The Macmillan Company. 
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TABLE 48. SHOWING THE CALCULATION OF THE FIVE ORDINATES FOR THE 
HEALTH DATA REPRESENTED ON A NORMAL SCALE 


ORDINATE AREA BETWEEN ibe: VALUE OF 
ORDINATES 2 ORDINATE 
27 -50 (.500) -000 
.020 
2 -48 (.480) -048 (.0484) 
244 
25 -24 (.236) .324 (.3269) 
423 
24 alle) (alge) .353 (.3548) 
141 
23 «39 (.328) -258 (.2550) 
allisy? 
Ze -48 (.480) -048 (.0484) 
.020 
2) -50 (.500) -000 


The means may now be obtained by subtracting the proper 
ordinates and dividing by the area between them. The work 
may then be set down as follows: 


TABLE 49. SHOWING THE CALCULATION OF THE MEANS OF THE 
HEALTH CATEGORIES 


MEAN VALUE FROM 3-PLACE TABLE | VALUE FROM 4-PLACE TABLE 

627 .048 — .000 =e 40 .0484 — .0000 eo Ao 
o .020 .020 

5X6 .324 — .048 Sk 91,08 3269 — .0484 _ 

o 244 fe: i ee 

405 .858 — .324 =+0.07 .3543 — .3269 SE OOG 
a 423 hee 423 PEM 

34 .253 — 358 _ _ 9 74 2550 — 3543 _ _ 9 79 
Oo 141 141 

203 -048 — .253 iL OR .0484 — .2550 N36 
Oo EY lay 

1X2 .000 — .048 20 -0000 — .0484 SONG 
o .020 .020 


As a check on the computation, the products of the means by 
the corresponding areas, when added, should equal zero (the 
mean of the whole distribution), for example, 
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2.40 x .020+ 1.18 x .244+ 0.07 x .423 
— 0.71 x .141 — 1.85 x .152 — 2.40 x .020 = + .00002. 


The check in this case is accidentally close. 

According to these results the “health” difference between a 
typical robust and a typical normal child is 1.06¢, while the 
difference between a normal and a rather delicate child is 0.78. 
It will also be noted that the mean of the normal health group 
is very close to zero (the mean of the whole distribution) and 
that the very delicate and very robust groups are equally diver- 
gent from this point. 

While comparisons such as these are often of great value in 
analyzing a body of qualitative data, the chief use of this scaling 
method is in studying the relationship between several traits. 
It is possible, for example, to obtain a measure of the rela- 
tionship (correlation) between health and general nutrition, or 
between health and intelligence, by representing the pairs of 
characters on normal scales (see Chapter XIV). 


8. THE SCALING OF TEST QUESTIONS 


The normal curve has been widely used in the scaling of stand- 
ardized test questions. Assuming that the ability of the pupils 
is measured by the difficulty of the exercises, the latter may be 
represented on a normal scale. If nearly all of a group of pupils 
solve a problem, its value will be low; if 50 per cent do an 
exercise correctly, its value will be at the mean; while if very 
few succeed on an item, it will be located high on the normal 
scale. The particular scale value of the item is thus determined 
by the per cent of the group solving the problem correctly. 

In Fig. 61 the percentage of correct solutions is shown by the 
shaded area, and the value of the item is given by the corre- 
sponding abscissa. In order to obtain this value for this ex- 
ample it is only necessary to enter Table 43 with 4a = .20, 
giving «= .524. The problem thus has a difficulty or ability 
value of .524 standard deviation above the mean. 
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By taking the mean at zero it will be noted that negative 
values of the deviates will occur. This may be overcome by 
shifting the origin to some convenient point, say 50 below the 
mean, as shown in Fig. 61. Such an arbitrary origin should not be 
confused with the point for “‘just no ability in the trait” sought 
after by some test makers. Just as temperature is measured on 
the Fahrenheit scale from an arbitrary zero, not representing the 


Fic. 61. Illustrating the scaling of test questions with the normal curve 


point for no heat, so educational scales may be taken from any 
convenient reference point, not representing ‘‘just no ability.” 

It is possible to scale the items one at a time, or several at once, 
as proposed by McCall.* The procedure by the first method 
may be further illustrated with some reading questions given 
to a large group of twelve-year-old pupils. In Table 50 the first 
and sixth questions will have negative deviates given by enter- 
ing Table 48 with + a = (.98—.50) and (.75—.50), while the 
other two questions will have positive deviates, being at the 
right of the mean. The final scaled values are obtained by 
merely adding 5 to each of these deviates. 


* McCall, How to Measure in Education. The Macmillan Company. 
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TABLE 50. SHOWING A METHOD FOR SCALING EACH TEST ITEM 


PER CENT 


SCALED VALUE 


PROBLEM neoeae fs : a es 
CORRECTLY y 
Ub, PR Nok neh hey mci 98 48 — 2.054 2.946 
Oise Sar eee 75 LAE — 0.674 4.326 
LS he ptt oe cia! 46 .04 + 0.100 5.100 
LAS ee eC ee ee 4 46 + 1.751 6.751 


By McCall’s method it is necessary to note the percentage of 
successful replies to at least 0, 1, 2, 8, - - - questions, the items 
being previously arranged in rough order of difficulty. Thus, 
with the above reading material, the following results were 


obtained : 


TABLE 51. SHOWING McCCALL’S METHOD OF SCALING TEST QUESTIONS 


NUMBER 


PERCENTAGE 


NUMBER OF On PUES OF PUPILS = SCALED VALUE 
akeees SOHNE IE Lapa Mi PLUS| 3$@ = ee 
ERTS pel eee : 

0 ik 99.9 499 |—3.090 1.910 

al 3 9925 495 |—2.576 2.424 

2 5 98.6 486 |—2.197 2.803 

3 7 97.3 473 |—1.927 3.073 

4 9 95.6 456 |—1.706 3.294 

Totale.m. eae 462 N 


In order to obtain the percentage of pupils above a given 
class value, McCall has added one half of the number of pupils 
at Q to the number exceeding Q, and then divided by the total 
number in the sample. The arithmetic for the first two values 
in the above table will then be 


461+3X1_ 
462 


= .999, 


458+3x3_ 
462 


995. 
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The deviates and scaled values are obtained by Holzinger’s 
Table XII and by adding 5 to eliminate the negative signs. 
McCall, however, multiplies these last values by 10 and calls 
them T' scores. 

According to the first method of scaling, the score of a pupil 
answering the first four questions correctly would be the sum 
of the four scaled values. By McCall’s method, such a perform- 
ance would be scaled by assigning the 7 score corresponding 
to @=4 from Table 51. McCall’s method is, therefore, very 
convenient, but there is some doubt as to the assumption that 
different sequences of problems (for example, 1, 2, 8, 4, 5, ---, 
1, 2, 4, 5, 6, etc.) obtained by various pupils have the same value. 

It should be noted that great precision in scaling test mate- 
Trial is idle. The figures above have been put down as they 
came from the tables, but they should ordinarily be rounded 
off to one decimal place at most. 

Scaled values are often an unnecessary refinement in measur- 
ing large groups as evidenced by the high correlations between 
scaled and unscaled items. Professor Douglass,* for example, 
found a correlation of about .98 between weighted and un- 
weighted algebra scores, a result which is much higher than 
the reliability of the tests themselves. He concluded that the 
unsealed values give the relative standing of the pupil with 
sufficient accuracy for ordinary testing uses. 

In the case of individual measurements, scaled values also 
lose much of their significance because they are based upon a 
large group and may not apply to a single person. Thus for 
the whole group, problem 1 in Table 50 has the value 2.9, 
while problem 6 has the value 4.3. For a given individual, 
however, it is not improbable that the two items are equally 
difficult. 


* H. R. Douglass and P. L. Spencer,‘‘Is it Necessary to weight Exercises in Stand- 
ard Tests?”’, Journal of Educational Psychology, February, 1923, p.109. Dr. Scates 
and the writer have also found correlations of .994, .995, .997, and .998 between 
weighted and unweighted scores, the number of items weighted varying from six 
to ten, and the weights being quite different. 
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The chief advantage in scaling by the above methods is that 
test results are thereby expressed in comparable units from 
comparable reference points (for example, a T of 60 on any 
test means 1 o above the mean). Weighted values may also 
be used to graduate test material in order of difficulty or to 
arrange parallel groups of items such as spelling words of 
equal difficulty. 

When test material is to be scaled by the judgment of experts 
rather than by the performance of the pupils, the normal curve 
may again be employed. The procedure here is to have the 
judges arrange the pupil 
specimens (say drawings) in 
order of merit according to 
their best opinion. If 50 
per cent of the judges rate 
specimen A as better than 
specimen B, these two are 
regarded as of equal value on 
the assumption that “equally 
rated differences are equal 

Fig. 62. Illustrating the scaling of JULES ey are always or 

items for a product test never noticed.” * : 

If 85 per cent of the judges 
rate specimen C as better than A, then the difference in value 
between A and C is obtained by finding the deviate for a= 
(.85—.50) = .85. Thus in Fig. 62, C has the scaled value x = 
1.04, the unit being the standard deviation with the origin at 
the mean. If the percentage of judges rating C better than B 
is 83, then a new scaled value x = .95 may be averaged with 
1.04 ete. 

By calculating similar differences for all pairs of specimens a 
series of scaled values is obtained. The origin may be taken at 
an arbitrary point (such as — 5a), but is often selected at the 
specimen which most judges consider worthless. 


* This is known as the Cattell-Fullerton Theorem. 


E A 
2.33 Boi 
O 
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As a further illustration of the arithmetic, two items may be 
added to the above series. Problem D is rated better than A 
by 93 per cent of the judges, while problem EZ is regarded by 
most as worthless. Assuming that 99 per cent of the judges 
rate A better than E, the value of the latter becomes — 2.33. 
The scaled values may now be written as follows: 


VALUES OF SPECIMENS 


ORIGIN 


E A,B Cc D 
A — 2.33 0 1.04 1.48 
1G) 2.67 5 6.04 6.48 
E 0 2.33 3.37 3.81 


The last row of numbers is probably most convenient to use, 
but zero is then only a rough approximation to “just no ability.” 

All these results were obtained by using A as the item of 
comparison, but approximately the same values would have been 
secured if all differences had been computed with reference to 
problem E. In the final scale it is usually best to select only 
those items which differ from one another by fairly large amounts 
(say .5 0), because, in using the scale, finer differences cannot 
be readily noted. 3) 


EXERCISES 


1. Find the probabilities of occurrences within the following ranges 
for a normal curve. Use Holzinger’s Table XI. 


RANGE PROBABILITY (Ans.) 
—2.5¢0 to—1.5¢0 .0606 
—2.50 to+2.5¢0 .9876 
+100 to+3.00 .1574 
+ 3.540 to + 3.880 .0001 
— 0.62 0 to + 2.79 0 -7298 


2. Allowing a range of 1.2 ¢ for each of the five marks A, B, C, D, 
and £, find the percentages of such marks under a normal distri- 
bution. (8.46, 23.84, 45.14, 23.84, 3.46. Ans.) 


v/ an 
ay ae 


£4 


a 
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3. Represent the following data on a normal scale and find the 
means of the five categories: 


GRADE OF SCHOOL WoRK Reese MEAN (Ans.) 
Ca ae on oy ee ME Ce ey 5 + 2.062 0 
ye ee pe en, Sas AA Me ch ek moe ee Us Zk +1.054¢ 
Ce BP. oe ee chao is” une to ca kon roars 49 — 0.013 ¢ 
DF eels Oe tet Pe Sas Ie, Co eee a ee eeaae 18 — 1.0190 
HAS a ee A hae Ee PEE, pe Boe ee PASS 4 —1.919¢0 


——* 


4, Highty-eight per cent of a group of judges rate drawing A as 
better than drawing B, while 75 per cent rate B as better than C. 
Assuming that 99 per cent of the judges have rated C as better than 
X, which has no merit whatsoever, obtain the values of the drawings 
C, B, and A with respect to X. 

(X = 032 C =2.3263 o; B=3.0008¢; A=4.11b8eR Ans 

5. In a large group of children, the percentage of those who solved 
a given example, with five specified examples considered one at a 
time, varied as follows: 94, 87, 61, 48, 11. Find the o value of each 
example, using as origin a point 5o below the mean. 

(3.4452, 3.8736, 4.7207, 5.1764, 6.2265. Ans.) 

6. Find the percentage distribution of five marks, using a range of 
1o for each. (6.06, 24.17, 38.30, 24.17, 6.06. Ans.) 


7. Verify the following results: 


NUMBER OF PERCENTAGE OF PER CENT EXCEED- s 

QUESTIONS PUPILS OBTAINING |ING, PLUS HALF THOSE T Score (Ans.) 

CORRECT = Q GIVEN Q REACHING Q 
0 pe 99 26.7 
1 6 95 33.6 
2 12 86 39.2 
3 18 (fl 44.5 
4 20 52 49.5 
5 14 35 53.9 
6 12 22 57.7 
ff 10 11 62.3 
8 6 3 68.8 


8. Calculate the ordinates for (.6 + .5)® and compare them with 
those of the normal curve. 


9. Fit normal curves to the distributions of I.Q.’s given in Table 55 
of Chapter XIII. Use columns 1, 2, 3, and 4. 


CHAPTER XIII 
SAMPLING AND RESPONSE ERRORS 


1. INTRODUCTORY 


All statistical quantities such as averages and measures of 
relationship are based upon samples. The results found from 
one sample will never quite agree with those found from another, 
nor with those from the whole population from which the samples 
were chosen. . In determining the stability of a given measure or 
in comparing the results from different groups it is therefore 
important to know the probable extent of such fluctuations. 

Thus a correlation of .80 may appear to indicate some rela- 
tionship between two traits, but if on taking another sample 
the coefficient is found to be .10, we can place little confidence 
in either of the two results. Some measure of the likely varia- 
tion from sample to sample is clearly desirable. 

Again, in the case of a control experiment, two means might 
be obtained for comparison, their difference being the test of the 
relative superiority of two methods of learning. For example, 
the mean gain might be 22 for a control group, and 20 for a 
practice group. The difference is 2, but whether or not it is 
of any significance remains to be shown. It might be that by 
repeating the experiment the difference would come out to be 
— 8 in favor of the other group. Here also a critical test of such 
differences under sampling is necessary. 

The stability of a statistical constant from sample to sample 
_ is often called its reliability* and is measured by the use of 
sampling formulas to be discussed in the present chapter. On 


* This term should not be confused with the reliability coefficient riz for a test. 
It might therefore be better to use the expression ‘‘sampling reliability”’ for the 
former. 
231 
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account of the rather elaborate mathematics involved only a 
few of the proofs of these formulas will be given, but their use 
and interpretation as applied to a variety of educational prob- 
lems will be treated at some length. 

Sampling formulas as applied to statistical data are usually 
approximations, their accuracy depending on certain assump- 
tions in the proofs and especially upon the number of cases 
involved. The chief danger in using such formulas without 
being familiar with the proofs may be avoided by never apply- 
ing them to a small number of cases (say less than thirty). 

In the last section of this chapter some of the current formulas 
for dealing with response errors will be presented. As noted 
in Chapter V, response errors are due to the variability of per- 
formance within the individual measured or tested. 


2. SAMPLING ERROR IN THE MEAN 


If the true mean of an indefinitely large number of observa- 
tions be denoted by M and their standard deviation by oz, 
and if the mean of a randomly drawn sample of N individuals 
be represented by Mi, the difference M — M, is known as the ° 
sampling error in the mean. It can be shown theoretically that 
if repeated samples of N be randomly drawn from the popula- 
tion, the differences M— M, will be distributed around zero 
with a standard deviation given by the formula 


Ox Standard error 


oyu" = Ne of the mean } (87) 


If the size of the samples is large the distribution of M— M, 
tends to follow a normal curve even though the population 
sampled is not normal. 


* A good proof of this formula is given in Jones’s ‘“‘First Course in Statistics,” 
p. 158. G. Bell & Sons, Ltd., London, 1921. The reasonableness of the formula is at 
once apparent from the faet that a small dispersion and a large number of cases 
decrease the size of oy. ‘ 


SAMPLING AND RESPONSE ERRORS 233 


As an approximation to an indefinitely large number of cases 
let us assume that we have 50,000 observations of a certain 
variable with the mean equal to M, and that samples of 500 be 
drawn. The means of these samples, which may be denoted by 
My, Mo, M3,... Moo, will be distributed about M in a fre- 
quency curve resembling that which would have been found 
had the number of samples been increased indefinitely. A 
hypothetical distribution of such means is shown in Fig. 63. 
The mean of all the sam- 
ples is 148, and the stand- 
ard deviation, oy, is 1.71. 

Now let us assume that 
one of the samples of 500 
cases furnishes a mean M, 
equal to 146, and a stand- 
ard deviation oz, equal 
to 37.12. By substituting 
these values in formula 
(87) we then find oy = 1.66. 
If the means and standard 
deviations from other sam- 
ples had been used in this Fig. 63. Hypothetical distribution of - 
formula, very nearly the means from one hundred samples 
same results would have 
been obtained for oy because oz will vary but slightly from 
sample to sample, provided the size of the sample is large. 

It thus appears that in dealing with only one sample the 
mean of the whole population is unknown, but may be approxi- 
mated by M1, and that the formula oy, = a gives the best 
obtainable approximation to the true standard deviation cy. 

The probable error of the mean is given by the formula 


143 144 145 146 147 148 149 150 151 152 153 154 


ie: Ox Probable error 
P. Eos = 8145 {the mean \ (88) 
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If the true values for M and o, were known it would then be 
possible to find a range on the normal scale within which it is 
almost certain that an observed mean M, must lie. In actual 
practice, however, it is M, 
and not M that is known, so 
that this argument must be 
reversed. 

The theoretical curve in 
Fig. 64 represents the inverse 
probability for various posi- 
tions of the true mean when 
M, is known. The value for 

Fic. 64. Illustrating various ranges P.#.m, by formula = ~ 
of probable error ona normal curve -6745 X 1.66 =1.12. Since 
half the area of the curve 
lies between M, — P.E. and M,+ P.E., or between 144.88 and 
147.12, the probability that the true mean lies between these 
limits is .5, and the result is ordinarily written M; = 146+1.12. 
By similar argument we find that the chances are over 99 in 
100 that the true mean will lie in the range M, +4 P.E., or 
between 141.52 and 150.48 as shown in Table 58. This range 
is the usually accepted zone of safety. 


141.52 143.76 146. 148.24 150.48 
142.64 144.88 147.12 149.36 


. 


TABLE 52. PROBABILITIES THAT THE TRUE MBEAN WILL LIE WITHIN 
A GIVEN RANGE 


RANGE PROBABILITY THAT M LIES WITHIN 
GIVEN RANGE 
Mi +1P.E. (144.88-147.12) .500 
Mi+2 P.E. (148.76-148.24) .822 
Mi+38P.E. (142.64-149.36) A 
Mi +4 P.B. (141.52-150.48) .993 
Mi+t5P.E#. (140.40-151.60) 5099 


The calculation of probable errors of the mean given by 
formula (88) is facilitated by the use of tables giving the values 
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of x, = a The probable error is obtained by multiply- 
ing the observed value of o, by the tabled value of x,. Thus, 
for o, = 13.1 and N = 147, we find from Holzinger’s Tables 
for Students, Table IX, that x, = .0556. The value for P. E.4 
is therefore .0556 x 18.1 = .728. 


3. THE PROBABLE ERROR OF THE DIFFERENCE 
BETWEEN Two MEANS 


One of the most useful formulas in sampling is that for testing 
whether or not small differences may have arisen from chance. 
The formula may be employed with a variety of statistical 
measures, but is most frequently applied in the case of the mean. 

If the variables in two groups, and hence their means, are 
quite independent of one another the probable error of the dif- 
ference M, — Mz is given by the formula 


Probable error of the 


POEM, Me — NN (P.E.,)° + (P.-E)? {aero between el (89) » 


uncorrelated means 


The use of this formula may be illustrated in the case of a 
control experiment in the teaching of physics. Two groups of 
pupils were equated with respect to intelligence and initial 
ability in a type of high-school physics. After teaching one 
group by the lecture method and the other group by the dem- 
onstration method a final test was given and results found as 
shown in Table 53. 


TABLE 53. DATA FROM PHYSICS-TEACHING EXPERIMENT 


DEMONSTRATION 


LECTURE GROUP Geour 
IPOpUaA tion eee isis aisle Garis. o> bey <> Nit aon Ny Al 
Meanisintelligence'score', . ..... . - 137 138 
Mean score on initial physics test ... . 74.3 74,3 
Mean score on final physics test . . . . . Mi= 91.43 Mz= 89.64 
Standard deviation for final physics test On ‘Xs O2= 7.23 
IProbablererrorote gem a ian cn ncn) <n P.E.u, = .785| P.E.me = -761 
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The probable errors of the means are given by formula (88). . 
Substituting these values in formula (89), we find that 


PH uy at — N 88) (OL) = 09 


the arithmetic being quickly done with a table of squares. 
The difference between final scores may now be written 


Mi — Mz = 91.48 — 89.64 = 1.79 + 1.09. 


Such a difference is regarded as insignificant, or such that it is 
not unlikely that the true difference is zero. This is illustrated 
in Fig. 65. Speaking approx- 
imately, since the number 
of observations is small, the 
probability that the true dif- 
ference lies outside a range of 
+2P.E., or —.39 to .397, is 
about .18 by Table 54. The 
probability that the true dif- 
ference will be outside the 
range 0 to 3.58 may be had 
from Table 54 by entering 


0.179 3.68 
139) 0) 2.88 3.97 
Fic. 65. Illustrating the probability 


that an observed difference will be as 
low as zero or as high as 3.58 with oe oat 1.64. the 


result being approximately .27. The chances are, therefore, 
approximately one in four that the true difference will be as 
small as 0 or as large as 3.58. 

In view of the above test the whole study is to be regarded 
as inconclusive. We have no right to ascribe the observed dif- 
ference of 1.79 to the superiority of the lecture method when it 
can be readily accounted for by chance fluctuations in sam- 
pling. It should also be noted that there are a large number of 
variable factors to be controlled in such an experiment. These 
factors can never be perfectly controlled and will undoubtedly 
affect the final result to some extent. It is assumed that the 
errors in sampling are independent of these factors. 
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TABLE 54. PROBABILITIES OF THE OCCURRENCE OF DEVIATIONS RELATIVE 
TO THE SIZE OF THE PROBABLE ERROR 


PROBABILITY OF PROBABILITY OF 
om A Ae ie A DEVIATION 
P.E BEYOND + ao P.#. BEYOND + a 
1 Pee ee eee eee -5000 SDM x deg a, Spe Pe ne .0430 
i ee RM A a -4581 SY Is ie CEM ct Ea -0365 
BLAZE 5. ae heen ae es aiear 4183 OL2 ee ee ae .0309 
US), GS 8 Bere Smee .38806 Oita ee ee .0260 
MCAD eee saree ee .8450 Qa a SARS BAe ey nals .0218 
MEP e etc Nie: s srerk ori chee 3117 Sesh Waa s PREY Sens 5s Le .0182 
El Gitta ers dech kone ie ne .2805 SO fs ee ey a ga .0152 
Lind, oe a, key eae L2ZOLD Ea Pw Doel hy Leta. 3 -0126 
hake, MRAP eR Be noopt ota -2247 StS: eee eee ee .0104 
AOE aS rade ates ee ae -2000 SR Westnet bad gars he epee .0085 
A Os Sa iene ie Ra oe Ser} AN QML RS) ee cueera lt yee .0070 
YA alk ARS aaa Pacer ene .1567 Ar Rey iy .4 Coe ro tsk .0057 
TAR Se foe Taare eer 1878 ID Ee ee a ee -0046 
ee ae ok a WT ey torr ro Ws -1208 AE ee ee ate -0037 
VS Ix SOO Te .1055 AIM vs ie) oy eed ee .0030 
2. Ouen tebe ite cae -0918 BO oc as ae Bee oa Ss -0024 
OOM mEn IR Sate voy -0795 A GaEe ae ea, -0019 
MIME Sa ini top 5.18) cs -0686 Elon | imes Gaeta -0015 
age a, Shy One a ee .0589 EL Dat RARE Lye .0012 
ee. wees) heen sts .0505 AOR tee, eave ONE a: .0009 


The general rule, already noted, is that a difference or a 
statistical constant of any sort is not significant unless it is 
at least four times its probable error. 

Table 54 gives the peeaer ath for deviations ale than 


» that is, 


ea and less than — a for various values of 5 


the fraction of the area under a normal curve ek a limits. 


4. THE PROBABLE ERRORS OF CERTAIN CONSTANTS 
FOR A NORMAL DISTRIBUTION 


The probable error of the mean may be used for any form of 
distribution, but in the case of certain other constants, it is 
assumed in the proofs that the distribution is normal. The 
following formulas should therefore be used only in case the 
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observed distribution from which the constants are obtained 
approximates the normal probability curve. 
The probable error of the median is given by the formula 


P. Baya = ae = 12688 P Base {ee median} (0) 
Inasmuch as the sampling error in the median is about 25 per 
cent more than in the mean, the greater reliability of the latter 
is at once apparent. For certain very peaked (leptokurtic) dis- 
tributions the median may be more reliable,* but for the large 
majority of problems the distributions are roughly normal and 
the mean is to be preferred. 

The standard deviation is one of the most reliable of all sta- 
tistical constants, its probable error being given by 


_ 67450 47690 Probable error of the 
P.E.¢ = = = = -1071 PE. -{ standard deviation } (91) 


Van VN 


In case P.E.y is also required, the last form on the right is 
probably the most convenient for computation. 
The coefficient of variation, V, has for its probable error 


P.E.y = -6745 V 1 AV | Probable error of the) 
aan V2N 100/ j * (coefficient of variation } 


The calculation is facilitated by the use of Pearson’s Tables V 
and VI, which give the values of x, and y. The formula may 
then be written 


Py=| a hy [1+2(Gt5) t= xa «ee Feat | S 


son’s Tables 


r (92) 


The probable error of the correlation coefficient is 


popes 6745 (1 — uel f Probable errort of the 94 x 
i \/N ‘correlation noe (94) 


* Yule, Introduction to Statistics, p. 338. 

+ The student is warned that this formula should not be applied when N is small 
and at the same time r is large. Misleading results may follow for such cases, as 
N= 20 andr=.5, N = 50 andr =.8, or N = 100 andr =.9. 
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Complete tables* for this error have been worked out by the 
writer for every value of N from 20 to 100, and by tens there- 
after up to 1000. A shorter table is also found in Table X of 
Holzinger’s Statistical Tables for Students. 

An approximate value for the probable error of the correla- 
tion ratio is 


it 6745 (1 — 7) Probable error of the 


qua V/N correlation ratio \ (95) 


so that the above tables may also be used for this measure of 
association. 

In the case of the regression coefficients b,, = ro = aint Oe r 24, 
the probable errors are ae 


PiB-y = 6745 — mE Sot (96a) 
otis —C oy /N ( Probable epee 
of regression co- 
ay ee : 
and Pineetor45 oe cients (96 b) 


Ox i 
Similar formulas are applied in the case of partial regression 
coefficients (see Chapter XV); that is, 


01.2k : Probable error of higher-order 


P. E.bj9. 2 = -6745 2-2-/N regression coefficient, } (97) 


k being any collection of secondary subscripts other than 1 or 2. 
These last formulas should not be confused with formulas (45) 
and (46), which give the probable errors of estimate of a single 
score by the lines of regression. 

In testing for linearity of regression, the probable error of 
6 = n? — r? has already been used. The formula is 


Penge) aida GPP +0. (98) 


Rey {Probable error of 4? — r2 


If »2 — r? is to be less than three times its probable error, the 
above expression reduces to formula (67) of Chapter X. 


* Karl J. Holzinger, Tables of the Probable Error of the Correlation Coefficient, 
Tracts for Computers No. XII, p. 35. Cambridge University Press, England, 1925. 
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5. Some APPLICATIONS OF PROBABLE ERROR FORMULAS 


One important use of the sampling theory is to determine 
whether or not two or more samples belong to the same or to 
different types of populations. This may be illustrated in the 
ease of the distribution of 4834 intelligence quotients given in 
Table 20. The total distribution may be broken up into the 
sub-groups given in Table 55. 

From the means and standard deviations at the bottom of the 
table, we may now test the difference between various groups 
designated from 1 to 6. If A and B are any two independent 
measures, formula 89 becomes 


P.E.a—p= V(P.E.a)? + (P. E.p)2. 
Using this formula together with (88) and (91) we find: 


M, — Mo = — 5.52 + V(.27)2 + (.84)2 = — 5.52 + .43 
and 01 —02= 5.98 + V(.19)2 + (.24)2 = 5.98 + .31, 


both differences being clearly significant. The grade and high 
school city children are thus to be regarded as distinctly dif- 
ferent intellectual types, the differences being probably due to 
selection. 

By similar calculations we obtain : 


M, — M3 = 9.34 + .37, o0,—o03=1.074+ .26, 
M2 — M3 = 14.86 + .42, 02 —03=4.91+ .31. 


Since all these differences are significant, the three white groups 
are to be considered as samples from essentially different types 
of populations. 

The means for the two negro groups are found to be signifi- 
cantly lower than those for any of the white groups. The differ- 
ence M4 — Ms, (2.64 + .78), does not prove to be significant by 
the usual test. From Table 54, however, it will be found that 
the odds are about 45 to 1 that city and country negroes are 
to be regarded as distinct intellectual types. 
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It is evident from the above comparisons that all five groups 
making up the total are to be regarded as samples from quite 
distinct population types. This lack of homogeneity doubtless 
accounts in part for the fact that group 6 does not furnish a 
good example of a normal curve. 

Another application of the formula 


P.E.a—-B= WV (P. E.a)? + (P.E.z)? 


may be made in the comparison of correlation coefficients. In 
the same number of the Journal of Educational Psychology two 
writers * presented correlations between mental ages on the Binet 
and the Herring intelligence tests. Dr. Herring gives the value 
r= .987 + .002, obtained from 116 twelve-year-old children, and 
Dr. Avery finds as his highest correlation, r = .824 + .031, from 
a group of 48 first-grade children. These two correlations are 
independent, since they were obtained from different groups. 
The difference by the above formula is then .163 + .031, which 
is more than five times its probable error, and therefore signifi- 
cant. A probable explanation { of the difference between these 
correlations lies in the fact that one of the tests is much more 
reliable than the other when applied to very young children. 

In case the’measures A and B are correlated the formula for 
testing the significance of the difference A — B becomes 


P.E.4—p= V(P.E.a)? + (P. E.3)? — 2 Raz(P.E.a)(P.E.3), (99) 


{Probable error of difference with correlated measures} 


where Raz is the correlation between the sampling errors in 
A and B. 

For two means M; and M2 from correlated material, the 
correlation between the sampled means, Ry,m:, is equal to ry, 
which is the correlation between the observed variables, so that 


* John P. Herring, “Reliability of the Stanford and the Herring Revision of the 
Binet-Simon Tests,” and A. T. Avery, ‘“Comparison of Stanford and Herring 
Revisions Given to First-Grade Children,” Journal of Educational Psychology, 
April, 1924. 

t It is also possible that formula (94) does not apply when r = .987, and N=116. 
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Pe E.m, -M,— V CP: E.u,)? + (P.E.m,)? —2rioP. E.u, P.E.m,. (100) 


{Probable error of difference between means where correlated} 


This formula may be illustrated by a comparison of the length of 
the left forearm for 1063 English males and their adult sons.* 
The results found were 


Ms = 18.52” + 0.021’, and Mr = 18.31” + 0.019”, 


while rrs was equal to .421, the size of forearm in father and son 
showing considerable correlation. Substituting in formula (100), 
we find that 


P.E.u, - mu, = V(.021)? + (.019)? — 2(.421) (021) (.019) = .022. 


The difference may then be written 0.21’ + .022. Since this is 
about nine times its probable error, there is no doubt that the 
sons of the professional English class were substantially differ- 
entiated from their fathers by a slightly longer forearm. 


6. THE PROBABLE ERRORS OF OBSERVED AND 
PERCENTAGE FREQUENCIES 


In comparing the frequencies between two groups it is often 
convenient to reduce them to percentages as in the table on 
page 244 taken from columns 1 and 2 of Table 55. 

If f denotes an observed frequency, its probable error is given 
by the formula 


Probable error of an 
P. Ey = 6745 \/ f (a eS fy," { observed ae cet (101) 
100f 


while for a percentage frequency f, = Ww? we have 


PE.s, — 6745 a a0 (100 — fp) i Probable error of a ral (102) 


centage frequency 


* Biometrika, Vol. II, p. 370. hi 
+ This formula may be derived from equation (105) by setting p= N and 


qs (1 - 2). For a complete and excellent proof see Jones, op. cit., p. 151. 


t Derived from formula (106) by finding the P.E. of 100 p, or fp. 
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TABLE 56. FREQUENCY PERCENTAGES OF I.Q.’S FOR GRADE AND 
HicH ScHOOL WHITE CHILDREN 


FREQUENCY PERCENTAGES 
Gy 
* Grade Schools High Schools 

L50=LGOs ewe, sess 6) pene cor uses 0.1 — 
AO = () Santee Me tt its ego oe <pots aise 0.6 = 
US ORT AOR 2 ee cots Dey ee ala oes esr 0.3 
P20=130 gee cc ce aire Pl) Gee ee kee AT 3.1 
U1LOEI20 Se; Ace £0 4 eta fees 13 17.5 
LOQS1 LOR cara v2 cghako es aus ees 22.5 ’ 39.3 
9O=LOO Se cut 28 cohen Gennes 26.5 28.5 
S090 Bec (Al wk ele een ie 17.9 10.8 
LO=SO0 =a) arc micd sien te rane 5 le 4 0.5 
60270 oon. eh See oes 2.6 a 
50=605~ 2 art po oe cee eee 0.6 —_ 
AQ= HQ mths Lc) 24a ho ees oe co lcaine 0.2 —_ 
SO=AO= Woo, veep ee eeoteea) eee aes 0.1 — 
TL OtaL ane 1c, er eee 100.0 100.0 


Applying formula (102) to the percentage frequencies in the in- 
terval 100 to 110, we find 


39.3 + .6745 ~ as or 39.3 + 1.67, 


92.5(77.5) 
and 22.5 + .6745 i ee 


P.E.(diff.) = V(.67)? + (0.71)? = 1.81. ; 


The difference 39.3 — 22.5 may therefore be written 16.8+ 1.81. 
We may conclude that a significantly higher percentage of 
high-school pupils is found in the group with I. Q.’s between 
100 and 110. 

Formula (101) is often useful in comparing observed with 
theoretical frequencies. Thus in Fig. 55 the area under the 
normal curve from 80 to 90 is larger than that given by the 
column of the histogram. In order to find the area under 
the curve it is necessary to express the class limits as deviates 
from the mean and enter a table of areas such as Holzinger’s 
Table XI. The arithmetic will then be as follows: 


or 22.5 + 0.71, 
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a 80— 89.28 aa _ 90 — 89.28 _ 
mI CIMGN Gite co aumieer  t O-O48 


a 2i20 mo = 0172, 


Therefore, the normal frequency is 4834 (.2120 + .0172), or 1108. 

From formula (101) the probable error of the observed 
frequency 1059 is .6745 .V1059 x .7809 =19. The difference 
1108 — 1059 = 49+ 19 might therefore be attributable to the 
fluctuations of sampling. 


7. THE CHI-SQUARE TEST 


In the case of a whole frequency distribution such as for the 
4834 I. Q.’s, a comparison of the observed and theoretical fre- 
quencies may be made by Pearson’s Chi-Square Test. Any 
such distribution is to be regarded as a sample from a much 
larger group. The problem is then to determine whether or not 
the fitted curve is a sufficiently good description of the observed 
data within the fluctuations of sampling. 

The test is made by obtaining all the differences between 
observed and theoretical frequencies, substituting the result in a 
formula, and determining by a table the probability that ran- 
dom sampling would give as bad a fit or worse. 

If the observed frequencies are denoted by 


ay f'2, d's; sy f'n 
and the corresponding theoretical frequencies by 
fi, fa, fa Gs Ins 


the value for x? may be written 
ree = {‘ ft ih. eteaatd (103) 


hg ft function 


* These values have been obtained from Table XI by linear interpolation, that 
is, when = = .55, $a =.2088 and when © = .56, $a = 2128. The value of $a for 
<= .559 is therefore .9 of the difference .0035 + .2088, or .2120. 


246 STATISTICAL METHODS IN EDUCATION 


Before taking up the probability test we shall next work out x? 
for the distribution of 4834 I.Q.’s fitted by a normal curve. 

In determining the values of f; it will be found convenient 
to obtain the fractional area from the mean to the limits of the 
various groups, and then subtract these values successively and 
multiply by N to give the theoretical frequencies comparable 
with f’,. The complete arithmetic for the values of f; is shown 
in the following table: 


y 
TABLE 57. SHOWING THE CALCULATION OF NORMAL FREQUENCIES 


Group Limits X:-M moa AREA FROM ane se f pee” 
a (o = 16.61) age Xr-1T0X| x 4834 
GOR AY = eer Ts + 70.72 + 4.26 -5000 -0001 0.5 
SOR, ee epee + 60.72 + 3.66 -4999 .0010 4.8 
AO WaMea Oe, eh cas + 50.72 + 3.05 4989 .0060 PALM) 
LS Qa, RR 2 es Ps + 40.72 + 2.45 -4929 .0251 121.3 
120 eee Yo. how s + 30.72 + 1.85 -4678 .0734 354.8 
ROE. ae tout tn ee + 20.72 + 1.25 .3944 .1522 735.7 
LOO CAs Se ates + 10.72 + 0.65 .2422 .2262 1093.5 
DO em Gu cures ee + 0.72 + 0.04 .0160 .2283 F 1103.6 
SO cas. “ahs. eRe — 9.28 — 0.56 .2123 -1647 796.2 
FL OQEREE Sloe ig ess hc — 19.28 — 1.16 .38770 -0838 405.1 
LNT oe ee Men oy ec — 29.28 — 1.76 -4608 .0301 145.5 
BO Wrece ik streaks as — 39.28 — 2.36 -4909 .0076 36.7 
sa ()) aa Mereeres- noe Witcare — 49.28 — 2.97 -4985 .0013 6.3 
SOR ea. kee — 59.28 — 3.57 .4998 .0002t 1.0 
CLOGALS,. A eating 1.0000 4834.0 


The only point where any difficulty is likely to arise is in passing 
over the group containing the mean. Here the frequencies must 
be added to obtain the frequency of the group. A rough diagram 
of the normal curve will clarify the whole procedure. 

In working out x? Professor Pearson recommends the con- 
solidation of the small frequencies in the end groups. For the 
top interval we shall therefore add 0.5 and 4.8. The excess of 
1.0 below 30 may also be added to the lowest group to give 7.3. 
Table 58 then shows the remainder of the calculation. 


* By Holzinger’s Table XI. f (.2123 + .0160 = .2283.) f .0002 is below 30. 
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The number of frequency groups is denoted by n’. Enter- 
ing Pearson’s Table XII with.n’=12 and x? = 46.6, we find 
P= .00001. The interpretation of this result is that once in 
100,000 trials we should get, in random sampling, a fit as bad 
or worse than that which would be obtained if the real distri- 
bution were represented by the normal curve fitted above. The 
actual fit is therefore a very bad one. Unless the value of P 
be .2 or more, the fit cannot be regarded as good and other 
curves should be tried. 

The importance of the x? test arises from the fact that it 
furnishes a rigorous method for determining goodness of fit. 


TABLE 58. SHOWING THE CALCULATION OF x2 — 


OBSERVED | THEORETICAL (fi — fa)? 
CLAss FREQUENCY | FREQUENCY f't—fe (f’t — ft)? 

ft ft ty 
WAGE) . Gg 6 6 Bee 14 5.3 =P Get 75.69 14.3 
30-140 20 ee 36 29.0 aie UD 49.00 Hoel 
WATUB NOS a aa bo 103 121.3 — 18.3 334.89 2.8 
OER, 2 6 ae 318 354.8 — 36.8 1354.24 3.8 
UCSD) 5G gf 5 oF 799 735.7 + 63.3 4006.89 5.4 
OT ag bo 1074 1093.5 — 19.5 380.25 0.3 
SSO 6 5 ahs 6 = 1059 1103.6 — 44.6 1989.16 1.8 
O-S0 le imei) ace ois 868 796.2 Sie les) 5155.24 6.5 
USD Aa stele eee 366 405.1 Soon 1528.81 3.8 
SUEO) Go st cee, Sete 163 145.5 + 17.5 306.25 2.1 
AGS) 5 9 6 5 818 25 36.7 illo 136.89 3.7 
COM) oe 5 ae ee 9 7.3 ap tet 2.89 0.4 
ARGU An eee ee 4834 4834.0 00.0 46.6 


Mere inspection of the data is of no value except to suggest the 
theoretical form of the curve to be fitted. When this has been 
selected by guess (or by the method of Chapter XVI) the fit 
should be tested by a procedure similar to that shown above. 
Other uses of the x? function will be given in Chapter XIV. 

A very much abbreviated table for the values of P is given 
in Table 59 on’ page 248 for use when x? and n’ are not large. 
This table has been taken from Pearson’s Table XII, the com- 
putation of which was done by Mr. W. P. Elderton. 
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TABLE 59. VALUES OF P FOR TESTING GOODNESS OF FIT 


x? 
7 8 9 10 11 12 13 14 15 
Rees ae, .986 | .995 | .998 | .999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 
DO mek ok 2920/1960") .98i 991 -996 | .998 | .999 | 1.000 | 1.000 
SB aU cre .809 | .885 | .934 | .964 ISI alee ool O96 ee OO SMe ooS 
APES Tet SOEs OO Ne OO mimeo 2947 1) {9707 2983) 1991 995 
Ge dad kee .544 | .660 | .758 | .834 -O91 |) 984 CODSHeLItoM) .oG 
Ge ees .423 | .540 | .647 | .740 .815 | .873 | .916| .946| .966 
gee Proto” < Byala) || BME |) BAe 725 | .799} .858] .902] .935 
ti ah Bn ois eR .238 | .833 | .4383 | .534 .629 | .713 | .785 | .844] .889 
G o @ oS cc 174 | .2538 | .842 | .487 | .582 | .622] .703] .773 | .831 
LOM pet eae 125 | .189 | .265 | .350 440 | .5380| .616| .694 | .762 
ily og Gears .088 | .189 | .202 | .276 | .858} .443] .529|] .611] .686 
WS! Se Gea (E .062 | .101 | .151 | .213 | .285 | .3863 | .446]| .528 | .606 
WS Mistoe oekee ys 043 | .072 | .112 | .163 224 | .293 | .369 | .448| .527 
4 et as! cae .0380 | .051 | .082 | .122 173 | .233 | .301 | .3874} .450 
SE oom sy ce .020 | .086 | .059 | .091 182 | .182] .241 | .307] .378 


8. THE PROBABLE ERROR OF AN OBSERVED PROPORTION 


It has already been shown in Chapter XI that the mean and 
standard deviation of the point binomial (q+ p)" are given by 
np and V npq, respectively. In the case where we are dealing with 
K samples of events each, the binomial becomes K(q + p)" for 
which the mean and standard deviation are the same as before. 

Now if the proportion of successes instead of the actual num- 
ber is recorded, it will be necessary to take one nth of the 
number in each sample. The mean proportion of successes will 
then approach p and the standard deviation will be given by 


ie pq . (Standard error 
hae Ts of a oat (104) 


The equations for probable errors of the mean number and of 
the proportion of successes in a sample are therefore 


P.E.np = 6745 Vnpq, {et errors of =} (105) 


mean and of the pro- 
and P.E.» = 6745. 22; portion of successes (106) 
respectively. 
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This last formula may be illustrated by the use of some data 
taken from the 1920-1921 Register of The University of Chicago. 
The total number of students for that year may be tabulated 
in the following form: 


MEN WOMEN TOTAL 
Graduate schools (group1) ........ 1,483 1,246 2,679 (n1) 
Undergraduate schools (group2) ..... 3,938 4,768 8,706(n2) 
Potala mieten atomic ye cade eee BM go ko. Zest} | 5371 | 6,014 115385 


The problem is to determine whether or not the proportion of 
men in the graduate schools is significantly larger than in the 
undergraduate schools. In this case a “‘success”’ is given by the 
registration of a man and a “failure” by the registration of a 
woman, while the total for each is the size of the sample, n. 
The observed proportion of men in the graduate schools is 
3679 = -535 = pi, while the proportion of men undergraduates 
is 2238 = .452 = po. Itis also evident that qi = .465, n1= 2679, 
q2= .548, and n2= 8706. From formula (106) we therefore have 


= [(-535)(465) _ 
p; = .5385 + .6745 5679 .535 + .0065, 
and po = 452 + .6745 . eo) .452 + .0036. 


The difference between the two proportions may therefore be 
written 


1 — p2 = .083 + V (.0065)? + (.0036)? = .083 + .0074. 


Assuming that the observed proportions are typical of other 
years, or that the above data furnish random samples, we may 
conclude that the graduate schools enroll a significantly larger 
proportion of men graduates. It should be noted, however, that 
the conditions brought about by the war might invalidate such 
assumptions. The safest procedure, therefore, would be to cal- 
culate the differences for a number of years. 
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Another method of approach to the above problem is to de- 
termine whether or not the difference between the two propor- 
tions could have arisen merely from the fluctuations in sampling 
in case the two groups are regarded as samples from the same 
or very similar populations. 

The proportion of men in both schools is given by po = 73°3'3'5 
= .472, with qo = .528. The equations for the probable errors 
of the proportions in the two samples will then be 


[Pogo ’ 
P.E.p, = 6745 nie {tes errors of aed | (107 a) 


tions of successes, based 
and ——~P.E., = .6745 [Pote. 


on both groups 
Applying these formulas to the above data, we have 


(472) (.528) _ 

6745. |[ACALOE) — 0065, 
4 (472) (.528) _ 

and P. By, = 6745. | 2X28) — 0036, 


agreeing to four places with the results found by formula (106). 
The difference test, of course, gives p1 — po = .083 + .0074 as 
before, and we may therefore safely conclude that random 
sampling could not have accounted for the difference between 
the observed proportions. The difference between the values 
given by formulas (106) and (107) is chiefly a theoretical one, 
for they do not differ largely unless p; and pe differ largely. 


(107b) 


P.E.», 


9. RESPONSE ERROR FORMULAS 


A number of formulas for dealing with the response error 
described in Chapter V will next be obtained. The notation to 
be employed may be given as follows: 


@ and zy = standard scores on two forms of Xj, 
2g and zz; = standard scores on two forms of Xo, 
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riz and r21 = reliability coefficients of X; and Xe, respectively, 
é; and e; = response errors in Xi by two forms, 
é2 and e7; = response errors in X2 by two forms, and 
s and ¢ = average or “‘true’’ scores of an individual on X, 
and Xe, respectively. 


We may therefore write 


Zi1—S e 
: + ei Standard scores in | 
Zj=s+ er terms" of ‘true’ | 


_? (108) 


Zga=t+ee, scores and re 
sponse error 
Zy=t+ ex. “ 


It will be assumed in the following proofs that the response 
errors, e, are not correlated with each other nor with the true 
scores, s and ¢. While this assumption is a reasonable one, it is 
not necessarily valid and the resulting formulas should be used 
with caution pending a verification of these assumptions. 

If two forms of a test are given we may write 


a er — 61 — 7. 
Squaring, summing for a group of individuals, and dividing by 
N, there results 
pay ated 2121 Der? Ly De12 Deer De? 
eMC wad ites ING ge Net GN 
or 2—2r;=203, 


By 


since 0, = 1, 6, = Ge, and ree, = 0. 
The required response error formula therefore becomes 


Standard error of re- 
Se= V1—n11, sponse using 2 Ea (109) 
or, if the original scores X are used, 
aa Standard error of re- 
Se = Ox, V 1 —ry- sponse using X aa (110) 


These formulas give the standard deviation of the N errors e1 
for a group and thus furnish an approximation to the standard 
deviation of many similar errors for a single individual. They 
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measure, therefore, the standard error of response within an 
average individual of the group. In case the probable error is 
used as a unit we have 


P.E.e, (of individual X1) = .6745 o4,V1— 11. (111) 
{Probable error of response for X;} 


As an example, consider a test with a reliability coefficient of 
.64 and standard deviation of 5. Substituting these values in 
(111), we find that P.E..,= .6745 x 5x .6=2.02. A pupil’s 
score, such as 31, may therefore be written 31 +2 with the 
interpretation that it is an even chance that his true score, 
assuming that there is no practice effect, will lie anywhere 
between 29 and 33. To facilitate calculations of this sort, 
values of V1 —r have been prepared and tabled in Holzinger’s 
Tables for Students, No. VIII. 

Returning to equations (108) and (109), we may next find the 
response error of the difference z; — z2 between two tests which 
may be quite dissimilar. The quantity required is o(c,—.,) OF 
J(e,-¢,): Since the errors e are all uncorrelated 

5 Ze? , Lez 2Leres _ 


Ces. = 
os ge NN, N 


From (109) the error o., = V1 —r1iz, and o., = V1— rei by 
similar proof. Substituting these values in the above equation, 
and taking the square root of both members, we find that 


Te, — 5) = > /z— = = fayr,* eee error of } (112) 


response for 21 — 22 


2 2 
ee al ae: 


or P.E, (of individual 2; — 22) = .6745 V2—ni—ren. (118) 


To illustrate this formula consider two tests, say in arith- 
metic and spelling, given in a school grade, and assume the 
reliability of both tests to be .5. The difference between two 
standard scores, say 2.6 and 1.4, for a given pupil may therefore 

* This formula was first derived by T. L. Kelley in Journal of Educational Re- 


search, September, 1923. A note on his proof is given by the writer in the January, 
1925, issue of the same journal. 
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be written 2.6 — 1.4 = 1.2 + .6745. Since this difference is ap- 
proximately twice its probable error the chances are about four 
to one that the true difference lies between zero and 2.4. 

An observed standard deviation will be larger than the true 
standard deviation because of the effect of response errors. This 
may be shown by writing 


v1=8 + €1, 
whence o; =07-+ o3. 
From equation (110) of =o? — o2ru, 
pomhat oie o,,Vnit- i Relation between true and (114) 


| observed standard errors j 


It is therefore apparent that only for a perfectly reliable test 
will the observed and true standard deviations be equal. 

Professor Spearman* has given a number of formulas for cor- 
recting correlation coefficients for response error, or ‘‘attenua- 
tion” as he calls it. One of the simplest of these may be worked 
out as follows: 


: : Dst 
The correlation between ‘‘true”’ scores is 75; = : 


No;0; 
Lst= D222, from equation (108), while o,= V7ri,and o,=V71o77. 


But 


Therefore, pes T12 ; Spearman’s ee (115) 
Vn Sy for attenuation 

where riz is the observed correlation.t 

As an example of the use of formula (115), if an observed cor- 
relation is .6 and the reliability coefficients of X; and Xe are 
both .8, the ‘‘true”’ correlation, with response error eliminated, 
will be .75. 

If o and > denote the standard deviations on a test for two 
groups, and r:; and R:, the respective reliability coefficients, it 
is evident from formula (110) that 


o-=oV1—N11 and On =2V1— Ritz. 


* C. Spearman, ‘Demonstration of Formule for True Measurement of Corre- 
lation,” American Journal of Psychology, Vol. XVIII (1907), p. 161, and “ Correlation 
from Faulty Data,” British Journal of Psychology, Vol. III (1910), p. 271. 

+ For other correction formulas see Yule, Introduction to Statistics, p. 213. 
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Assuming with Professor Kelley* that o, = on, or that the test 
is “‘equally effective” for both groups, we find that 


Cay ety, 
F SS"? Kelley’s formula for } (116 a) 
Vi=ny {satin reliability j 
2_ 32(1—R oefficients 
Pee ects) |: ass) 


This formula has been used to adjust correlations for differ- 
ent ranges as illustrated by the following examples. If the re- 
liability of a test is given by Ri; =.5 for a range with 2 = 5, 
what will the reliability be for a range with standard deviation 
of 10? From (116b) we find 717; = .875, which shows the effect 
of ‘‘range of talent” upon the reliability coefficient. It should 
be noted, however, that for very small values formula (116) 
gives results of doubtful significance. Thus when 2 = 5, o= 10, 
and Ri; =.01, we find r13;=.75. That a test which is practi- 
cally worthless on one range should be quite reliable on range 
with twice as great variability is contrary to all experience with 
such measures. 

A general criticism of all the above formulas is that the as- 
sumption of uncorrelated response errors does not appear to be 
justified.t Such negative evidence, however, is not sufficient at 
present to warrant the entire abandonment of the formulas, and 
they are offered here for tentative use until further evidence in 
proof is available. 


EXERCISES 


1, Find the probable errors of the frequencies at I.Q. 80-90 given 
in the columns of Table 55. (10.2, 4.1, 12.1, 9.4, 3.7, 19.4. Ans.) 


2. Determine the probable errors of the following correlation co- 
efficients: r = .162 (N = 87), r = .088 (N = 640), r = .204 (N = 49), 
r= — .187 (N = 210), r = .083 (N= 40). Use Holzinger’s Table X. 

(.070, .026, .092, .046, .106. Ans.) 


* Kelley, Statistical Method, p. 222. See also Chapter IX of the present text. 

+ William Brown and Godfrey H. Thomson, in ‘Essentials of Mental Measure- 
ment’’ (Cambridge University Press, England, 1921), show correlation between 
such errors. 
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3. Test the significance of the differences between the means 
and standard deviations given in Table 55. Use Table 54. 


4. The following data were obtained from four groups: 


M, = 104 a1 = 10.0 Ni = 110 

Mz = 101 Gy = ile Nz = 97 

M3 = 102 o3 = 9.6 N3 = 92 

Mz = 103 Cin, = Bh N4= 106 
Find the probabilities that M, will be larger than M2, M3, and Ms, 
respectively, on the next sampling. (OSs. o2 mel oe ANS.) 


Hint. Use Table 54. 


5. In a six-month period, 454 deaths from automobiles were re- 
ported in New York and 260 in Chicago. The populations of the 
two cities were 5,600,000 and 2,700,000, respectively. Are ‘‘Gotham’s 
streets safer for the pedestrian than Chicago’s,’’ as reported by a cer- 
tain newspaper? (Difference in death rates is three times its P. E.) 


6. The following data were taken from the President’s Report of 
The University of Chicago, 1923-1924. 


MEN ‘WOMEN TOTAL 

(Grmlnenagenooky 5 6 5 2 6 » 5 o o o Se 2,083 1,634 Slt 
Undergraduate schools) 2225. 2 2. . 4,215 5,425 9,640 
ROGAN of te BS via Mets ye ee ee castle 6,298 7,059 13,357 


Find the proportion of men in the graduate and in the undergrad- 
uate schools, and test the significance of the difference found. 
(pi — po = .123 + .0065. Ans.) 


7. Fit, with a normal curve, the distribution of the Terman scores 
given in Exercise 3 of Chapter II, and apply the x? test. 
(Po 6 AeA ILS) 


8. Apply the x? test to the distributions of I.Q.’s fitted in Exer- 
cise 9 of Chapter XII. 


CHAPTER XIV 


FURTHER METHODS OF CORRELATION FOR 
TWO CHARACTERS 


1. INTRODUCTORY , 


The correlation methods discussed thus far have been those 
which are applied to quantitative series or to traits which are 
measurable on a numerical scale. In case the series are quali- 
tative or unordered, in the sense used in the second chapter, 
other methods for measuring the association become necessary. 
The present chapter will therefore be concerned with the treat- 
ment of such series by suitable methods. 

In order to illustrate the combinations of series that may arise, 
we may begin by listing some of the possibilities with short sup- 
posititious examples. The table below illustrates the case of an 
association for quantitative and qualitative series, intelligence 
being measured on a numerical scale and school work rated in 
verbal categories in orderly progression. 


TABLE 60. ILLUSTRATING ASSOCIATED QUANTITATIVE 
AND QUALITATIVE SERIES 


ScHooL WorK £-Q: 
80 90 100 110 120 
Good ret ee ere 3 12 "14 sil 8 
B 
Medium ...... 4 15 17 2 é 
=) 
< 
Poona aoa oo 7 3 12 = & 


QUANTITATIVE 


Table 61 on page 257 shows the association between two 


qualitative series, both characteristics being verbally indexed. 
256 
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TABLE 61. ILLUSTRATING ASSOCIATED QUALITATIVE SERIES 
BEHAVIOR 
ScHOOL WORK 

Bad Troublesome Good Excellent 
Good. pipes na ane 3 9 12 14 a 
a 
Medium 2. ..47 04 4 10 16 2 & 
J 
<q 
Poor 10 2 of — & 

QUALITATIVE 


In both tables there appears to be some association between 
the traits, but it cannot be adequately measured by the product- 
moment correlation in the form used in Chapter IX, because 


of the lack of numerical indexes for the categories. 


An example of association for quantitative and unordered 
series is next given in Table 62, the characteristics being the 
intelligence of children and the occupation of their fathers. 


TABLE 62. ILLUSTRATING ASSOCIATED QUANTITATIVE 


AND UNORDERED SERIES 


OccUPATION OF FATHER 


8 


I.Q. oF CHILD 


90 


100 


110 


120 


Teacher. 7 ill 12 10 

Doctor = =~ 3 g) 14 8 : 

Lawyer. . 3 6 9 12 EB 

Wiriter St. 2. 38 4 Wl ii) 
QUANTITATIVE 


The relationship in this case cannot be observed very readily, 
because the arrangement of the occupation categories is a matter 
of indifference. A quite different method of measuring associa- 
tion will therefore be required for such a problem. 
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A complete list of the combinations of series which may arise 
is given as follows: 


a. Quantitative with quantitative 
b. Quantitative with qualitative 
C: Quantitative with unordered 
d. Qualitative with qualitative 

é. Qualitative with unordered 


ip Unordered with unordered 


While some of these occur only rarely in statistical work, it is 
nevertheless desirable to have suitable methods for dealing with 
each type of association. The methods, however, are by no 
means restricted to one type of problem, and consequently the 
choice often becomes a difficult matter. In the present discus- 
sion we shall select a few of the outstanding methods available 
and apply them to problems with suggestions as to the appro- 
priate method to employ whenever possible. 


2. ANOTHER FORMULA FOR THE PRODUCT-MOMENT METHOD 


Before taking up the correlation of qualitative series we shall 
first introduce a modification of the product-moment formula 
convenient for dealing with such data. The method was pre- 
sented by Professor Pearson in one of his lectures at the Uni- 
versity of London. 

Using the notation of Chapter IX, the product-moment 
formula may be written 


| Baud ~ CEI CHAD | ip 
er (117) 


No Oy 


where the product hk occurs because the numerator is expressed 
in class intervals. If Y, denotes the mean of a column and M, 
the mean of the whole table, it is also evident that 


P.— My = (Zlade — Bhat, 
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Multiplying both members of this equation by hd,, summing 
over the whole table, and noting that f, is merely a symbol of 
operation, we have 


Yfde(Y¥2— M,)h = | Sfed.d, 5 fete) hit) re 


Substituting this result in formula (117), we then obtain 


no ROG ls (118a) 
re No xy Pearson’s formulas for the) 
and, similarly, correlation coefficient based } 
) Zf,d,(X, — Mx) kenuon the means of the arrays } (118b) 
NoOxOy 


‘The above method is very convenient when the means of the 
arrays are known, for it is then not necessary to calculate the 
quantity 2f.,d.d, from the individual cells. It should be noted 
that the variables d, and d, may be taken from any origins 
whatsoever, and it may seem a little curious at first that the 
values of formulas (118a) and (118b) remain unchanged when 
the origins are shifted and all quantities except d, and d, are 
fixed throughout in these formulas. 


TABLE 63. ILLUSTRATING THE CALCULATION OF THE CORRELATION 
COEFFICIENT BY FORMULA (118a) 


x Ver Yz- My fa dy, (Yr —My)frdxh 
184.5 72.25 + 18.50 il 5 + 925.00 
174.5 52.25 — 1.50 1 4 — 60.00 
164.5 64.75 + 11.00 4 3 + 1320.00 
154.5 60.89 + 7.14 11 2 + 1570.80 
144.5 57.25 + 3.50 9 1 + 815.00 
134.5 52.70 = 11.05 11 0 0.00 
124.5 45.25 — 8.50 5 —1 + 425.00 
114.5 42.25 — 11.50 4 a + 920.00 
104.5 37.25 — 16.50 2 =H) + 990.00 

94.5 37.25 — 16.50 1 aA + 660.00 

84.5 32.25 — 21.50 1 =5 + 1075.00 
50 8140.80 

My, = 58.75 o, = 19.92 , = 10.50 No,0, = 10,458 
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In order to illustrate the application of this method to quan- 
titative data, the correlation problem shown in Table 30 of 
Chapter IX has been worked out on page 259, using formula 
(118a). The means of the columns Y, were calculated as for any 
distribution, and the values for d,, were taken from the arbitrary 
origin 184.5. A check on the numerator of (118a) may be made 
by shifting to another origin and recalculating the sum of all 
the products. The proof of this check is left as an exercise. 


8. THE PRODUCT-MOMENT METHOD FOR QUALITATIVE SERIES 


A qualitative series may be converted into a quantitative 
one by representing the data on a normal scale as shown in sec- 
tion 7 of Chapter XII. The various groups will then be desig- 
nated by numbers instead of by verbal description, and the 
product-moment method may then be applied for measuring 
the amount of correlation. 

The following table represents the correlation between the 
score on a physics test and the rating of the teachers for 245 
high-school pupils. The combination is, therefore, a quantita- 
tive series with a qualitative one, and the latter will need to be 
converted to a normal scale. 


. 


TABLE 64. DATA FROM A PHYSICS TEST AND TEACHER RATING 


TEACHER RATING 


TEST SCORE TOTAL 
Poor Fair Good Excellent 

70-80 — a "3 ; 2 4 
60-70 i 6 12 18 37 
50-60 11 15 24 18 68 
40-50 19 26 23 16 84 
80-40 10 ‘lle 9 4 40 
20-30 6 2 1 1 10 
10-20 — 1 + — 1 
0-10 1 — == — 1 
Total 48 67 Ef 59 245 

Per cent 19.6 DS 29.0 24.1 100.0 
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The ordinates bounding the various pieces under the normal 
curve are most readily found by entering a table such as Hol- 
zinger’s Table XII with the cumulative frequencies .196, .469, 
and .759, each less .5, or with the values — .804, — .031, and 
.259. The three ordinates re- 
sulting are .2766, .3977, and 
.3116, respectively, as illus- 
trated in Fig. 66. 

The means of the various 
pieces may now be worked 
out by formula (86) of 
Chapter XII; for example, 


Xp _ 0 => -2766 _ ZAR All, —1.41lo-.4440 +2976 +1.2930 


oC 196 -3¢ -20 -lo 0 lo 20 Big 
xz 
Fic. 66. Illustrating the means of the 


merece P ig the mean of the four rating categories when the series 
Ox is represented on a normal scale 


“poor” category. For the 
other three means, we obtain —.444, .297, and 1.298. These 
numbers are to be regarded as class values in the subsequent 
calculations. 
Since M, = 0 for a normal distribution, the required formula 
may be obtained from (118b) in the form 
ri 


xe 
LZfydy (=)r Correlation coefficient 
—————-» adapted for use with (119) 
Noy data on a normal scale 


where =" ” denotes the mean of a row measured from the mean of 
the i The values for = “ are obtained by multiplying the 


frequencies in each row by ae class values just obtained, and 
dividing by the total in the row. Thus for the top and next row, 

@75 2 X 297 + 2 X 1.293 _ 4.195, 

Ox 4 
“Zes _ 1(— 1.411) + 6(— .444) + 12(.297) + 18(1.293) __ 

= = SE Sy Oe = + .615, 
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X35 


Zz 
and, similarly, i Serio = =—.129, [= 345, 
tab 77g eh Add and ee ee 
Or Or Ox 


An arrangement of the computation for the product sum and 
g, is shown below. The work is best done with a machine. 


TABLE 65. ILLUSTRATING THE CALCULATION OF THE CORRELATION 
COEFFICIENT FOR THE DATA IN TABLE 64 


fy dy Sydy = fydy = fydy? 
4 3 i .795 9.540 36 
onl 2 74 -615 45.510 148 
68 1 68 sieal 8.228 68 
84 0 0 — .129 0.000 0 
40 -—1 — 40 — .345 13.800 40 
10 —2 — 20 — .776 15.520 40 
1 —3 —3 — .444 1.332 9 

1 —4 —4 — 1.411 5.644 16 
245 + 87 99.574 357 


Oy=11.54 No, = 2827.8 
— 99.574 X 10 _ 955 
2827.3 ; 


By plotting the means of the rows 7 as shown in Fig. 67, a 


graphical representation of the regression is given. It will be 
noted that the points fall fairly closely along a straight line, so 
that the regression is probably to be regarded as linear. The 
equation of the regression line through the mean of the table is 
Tet one 
Gnoy Ox 


ting are given by substituting y = + 30 in the above equation or, 
cos v 
momen bidet oe 
Y =)78.05 Y="18.55 


When both series are qualitative, the above method may be 
applied to the two scales; but, since certain corrections are 


= .0805 y. Since M, = 48.55, two points for plot- 
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sometimes desirable, Y 80 
another procedurewill ™ 
be shown. In ealcu- i 
lating the correlation 49 
coefficient and other 20 
measures of associa- 
tion an error is intro- 

duced by grouping 15 10 6 0 5 10 15 2 
the material in broad Fic. 67. Regression line for the physics data 
categories. Professor 
Pearson* has devised several formulas for correcting this error, 
one of which may be written in the form 


Uf xy aa — 2541)(2's — 2's 41) 


cNxy = [z zi Chee >] Be (CRDi | 


Pearson’s corrective formula for broad grouping 
assuming normal distributions of the variates 


» (120) 


where the z’s are ordinates bounding the various pieces under 
the normal curve and the unprimed and primed values refer to 
X and Y, respectively. The use of this formula will next be 
illustrated by a problem which has been taken from Professor 
Pearson’s paper cited in footnote below. 


TABLE 66. PEARSON’S DATA ON INTELLIGENCE AND QUALITY OF CLOTHING 


INTELLIGENCE RATING 
neta OF ona: 
LOTHING B C D BR PF G . 
oe hy ea oe 33 48 113 209 194 39 | 636 
Oe oe ee ee ‘ 41 100 202 255 138 15 751 
Tee he as 39 58 70 61 33 4 265 
ViandeV eee ee cee Nef 13 22 10 10 1 : 73 
‘Lotal arom mee 130 Pais) 407 5385 315 59 1725 


* Karl Pearson, ‘On the Measurement of the Influence of Broad Categories 
upon Correlation,” Biometrika, Vol. IX, p. 119. i 
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By using X for intelligence and Y for quality of clothing, the 
ordinates on the two scales may be found in the usual way, and 
the quantities needed for formula (120) may be worked out 
as shown in Table 67, pp. 264 and 265. Holzinger’s Table XII 
has been used throughout. The corrective value becomes .315. 
Comparing this result with that obtained in the computation by 
Professor Pearson, it must be noted that his .317 was worked 
out with a somewhat different corrective formula. 

Needless to say, the arithmetic is very laborious and must be 
done on a calculator. The above correction, however, is impor- 
tant, and formula (120) or similar forms given in Pearson’s paper 
should be used for the best results. 

In case only a rough approximation to the correlation is de- 
sired, class values such as 1, 2, 3, - - - may be assigned to both 
sets of categories, and the coefficient may be worked out by 
the method of Chapter IX. The student is urged to work out 
this value for the above problem in order to compare results. 


4, THE CORRELATION RATIO FOR QUALITATIVE AND 
UNORDERED SERIES 


When a series has been represented on a normal scale, the cal- 
culation of the correlation ratio becomes very simple. The work 
will be illustrated by the problem of the preceding section. 

Since M,,= 0, formula (61) for the correlation ratio based on 
the means of the rows becomes 


[Se nen Xy\? 

— afy ee Correlation ratio adapted 

Ay = —— = ee for use with data on a nor-+ (121) 
x 


mal scale 


For the data given in’Table 64 the arithmetic may be ar- 
ranged as shown in Table 68. The work is very easily done in 
this problem because the means of the rows are already worked 
out in Table 65. The complete calculation is shorter than that 
for the correlation coefficient, since o, is not required. 
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TABLE 68. ILLUSTRATING THE CALCULATION OF THE CORRELATION 
RATIO WITH FORMULA (121) 


Zy By\2 Zy\2 
= (=) fv | (4) 
795 .6320 4 2.5280 
615 3782 37 13.9934 
121 0146 68 0.9928 al = 13012 
— 129 .0166 84 1.3944 
— .845 .1190 40 4.7600 “. 1 = V.13012 = 361 
=n .6022 10 6.0220 
— 444 1971 1 0.1971 
S141 1.9909 Le 1.9909 
245 31.8786 


Applying Blakeman’s shorter test for linearity, we find that 


V 245 V (.861)? — (.852)? = 1.25 < 4.05. 


Since 1.25 is less than one third of 4.05 and N is fairly large, 
the regression in this case may be regarded as sensibly linear. 
If one of the associated series is quantitative or qualitative 
and the other unordered, one of the correlation ratios may al- 
ways be found. Thus, if Y be quantitative, the ratio nyz has 


the form = 
a 2f(My pelea 
N Correlation ratio for 


Nyx = <a ae means of columns \ (62) 


and is to be regarded as the ratio of two standard deviations, 
both depending upon Y only. The arrangement of the X cate- 
gories is clearly a matter of indifference, since it will not affect 
the numerator or a, in the above expression. 

An example of a qualitative and an unordered table is fur- 
nished by some data from a study by Mr. Tulchin of the Chi- 
cago Institute for Juvenile Research. A large number of children 
were rated by their teachers as of the “annoying,” “sympa- 
thetic,” or “unsympathetic” type, and also classified in five intel- 
ligence categories. Inasmuch as the three “attitude” categories 
do not necessarily come in any order, they furnish an unordered 
series. The table of frequencies appears as shown on page 268. 
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TABLE 69. TULCHIN’S DATA ON INTELLIGENCE AND ATTITUDE 


ATTITUDE 
INTELLIGENCE TOTAL 
Annoying Unsympathetic| Sympathetic 

5 Very Superior .. . 5 — 219 224 
ALSSNSAO!e 5 5 5 6 5 24 12 1213 1249 
SIN(OwMAl ~s G a 6 c 105 103 2451 2659 
Aika Bc iy & 6 0 131 108 1021 1260 
i Wenig lhenisetore 5 5 Ge 73 82 174 329 
UMN 4B temp, Benge 338 305 } 5078 5721 


Although the method employed with this problem will be the 
same as that for the physics test, the results will be worked out 
for the purpose of further illustration and for comparison of the 
association measured by a later method. The percentage fre- 
quencies of the intelligence distribution are 5.8, 22.0, 46.5, 21.8, 
and 3.9, beginning with the Very Inferior group. The ordinates 
between the pieces by the method of the preceding section are 
therefore .1160, .8854, .8224, and .0844 (Holzinger’s Table XII). 
By formula (86), the means of the five pieces under the marginal 
distribution become 


Hi O—.1160 | Ye _ 1160 — 3354 _ 


Sire CALA wees 220 Tea 
Us _ Ys _ Ys _ 
4.028, B= 41.092, and B= +2.164. 


Multiplying these class values by the corresponding frequen- 
cies in the columns, the means of the three columns become 


bale eye leat Ue _ 
= .7001, ee .83838, and Soa ewe 


y 7) 
the subscripts referring to the verbal categories. The remainder 
of the computation is given in Table 70. 
There are a number of corrections which may be applied to 
the correlation ratio to adjust for too coarse or too fine grouping. 
The correction for broad categories may be illustrated in the ease 
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TABLE 70. ILLUSTRATING THE CALCULATION OF THE CORRELATION 
RATIO FOR TULCHIN’S DATA 


a (%) fe fe(22) 
+ 0987 009742 5078 49.470 
— .8383 102747 305 214.338 
— .7001 490140 338 165.667 
5721 429.475 
ar =.07507  .*. Nyx = V.07507 = 274 


of the data in Table 64 for which 7,, = .385. With -n,2 denoting 
the corrected ratio and r,, the correlation of x with its class 
value, Professor Pearson* has shown that 


_ Nyx Correlation ratio corrected 
eux = Tre { for broad categories } (122) 


N Correlation of a variable 
Ww — see = 2 
here Tre \ 2 a (2s — Zs +1)?. with its class value } (128) 


The computation will therefore be as follows: 


TABLE 71. ILLUSTRATING THE CALCULATION WITH FORMULA (122), 
FOR THE DATA OF TABLE 64 


Zs — Zs41 (Zs — 2841)? - x (Zs — 23 +1)? 
Zo — 21 = — .2766 .076508 5.1042 .890512 
zi — 2 = — .1211 .014665 3.6567 .053626 
zz — 23 = + .0861 .007413 3.4507 .025580 
23 — 24 = + .3116 -097095 4.1526 .403187 
.872905 


tee = V.872905 = 934." Myx = "38? = 412 


In case there is a fairly large number of categories and N 
is not large, a correction for fineness of grouping may become 
* Karl Pearson, ‘‘On the Measurement of the Influence of Broad Categories upon 


Correlation,” Biometrika, Vol. IX, p. 116. See, also, Student, ‘‘ The Correction to be 
made to the Correlation Ratio for Grouping,’ ibid. p. 316. 
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important. This adjustment is especially important in dealing 
with small coefficients even if N be large, as may be illustrated 
by an example in a paper* by the writer. The correlation ratio 
for breathing capacity on reaction time to sight was found to 
be .1404. Mr. R. A. Fisher} has proved that when we sample 
from material for which the actual value of 7 is zero, and ¢ is 
the number of arrays, then the mean value 7? from sample to 
t—1 
N-1 
other words, although the true value is zero, the observed 
value will not be zero, owing to the grouping and to the sam- 
pling deviations which must always enter as positive quanti- 
ties. In the present example N = 3373 and t = 17, so that 7? 
from this formula is .004745, the probable error of which is 


6745 ~ pa or .001128. The difference, n? — 72, may 


now be written as .014967 + .001128, and we may conclude that 
it is extremely unlikely that the ratio found could have arisen 
from the fluctuations in uncorrelated material. 

For breathing capacity on keenness of hearing we find, like- 
wise, 7 = .0840+ .0115, ¢=15, and yn? — 72 = .002904 + .001056. 
In this case the observed value would appear to be significant 
by the usual test based on its own probable error; but when 
n? and 7%? are compared, their difference is less than three times 
the probable error of 72, and hence the observed correlation of 
.0840 may be ascribed to the fluctuations in sampling. Breath- 
ing capacity and keenness of hearing are therefore uncorrelated. 

Corrections for coarseness of grouping may also be made in 
the case of the correlation coefficient. The reader is referred to 
Sheppard’s corrections given in Chapter XVI and to a paper 
by Professor Pearson. t 


sample will be , where N is the size of the sample. In 


* Karl J. Holzinger, ‘On the Relation of Vital Capacity to Certain Psychical Char- 
acters,” Biometrika, Vol. XVI, p. 145. 

{ R. A. Fisher, “The Goodness of Fit of Regression Formulas,’”’ Journal of the 
Royal Statistical Society, Vol. 85, p. 597. 

} Karl Pearson, ‘‘On the Correction Necessary for the Correlation Ratio,” Bio- 
metrika, Vol. XIV, p. 412. 
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5. BISERIAL r 


If one of the characters in a table such as Table 72 is quantita- 
tive and the other consists merely of two qualitative categories, 
it is possible to find the Group 1 Group 2 
correlation very simply 
by a method known as 
biserial r. 

In the derivation of 
this coefficient it is only 
necessary to assume that 
the distribution of the 
twofold (or dichotomous) / 
character is normal, and “yz ( 
that the regression in the yy 
table is linear. From Fig. . Ai) UL, rl 
68, where the usual nota- 
tion is illustrated, it ap- 
pears at once that the slope of the regression line is given by 

b ees Ya 
Oe “2 V1 Lo —%1 


x 


Fic. 68. Illustrating biserial r 


Making use of the above value, the correlation coefficient may 


now be written Cr 
C= 07. — 
oy 
Me g2—-" a Li. 
or =A me 


The numerator of this last expression becomes (Y2 — Y1)/ey; 
that is, the difference between the means of the two columns, 
divided by the standard deviation of Y for the whole table. 


The quantities — 2 and & are the means of the two pieces 
under the arene curve Ae are readily found by the use of 
formula (86). Denoting the fractional area = by qg and the 


remaining area 7 by p, it follows from (86) te 
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Y2— U1 
Ox 


REN ED) ence 


Cae Dg Pq 
The desired formula may then be written 


ee (22) . {Biserial r} (124) 


@ 


This. = 


TABLE 72. RENT AND HEALTH OF YEARLING BABIES ILLUSTRATING 
THE METHOD OF BISERIAL r 
‘ 


HEALTH 
RENT IN SHILLINGS TOTAL 
Not Good (1) Good (2) 

= 
=, Nee, ene i s a Wr = 9 = 2856 
Tete 8 ee 88 — 4 4 z 
Ok Sas see = 4 4 A= p= .T144 
Geiser Ge oy ten 1 f iM 2 = 3399 
C0. tenes, ote 1 1 iy 
65 ae ee 4 45 49 ER tos 
Cy oe te Bera 16 82 98 i= te 
A eet © ary 53 252 305 Ty eal 
La, Megs Oy 101 303 404 (Sheppard's 
SDN pene eect 132 182 314 correction) 
SORE eke oe 55 64 119 
Gay aA Rat * en 26 18 44 r = 854 
OA) at toe en ff iu 14 

hotalie eens 397 =n1 993 = no 1390 =N 


. 


The computation will next be illustrated by Table 72. The 
means Y; and Y2 are found to be 3.7065 and 4.1798, re- 
spectively, and a, = .8021 with Sheppard’s correction. Next, 
dividing ni by N gives gq = .2856 and dividing nz by N gives 
p = .7144. Upon entering Holzinger’s Table XII with p — .5 = 
.2144 the value for z is found to be .3399 with linear interpola- 
tion. Substituting all these values in formula (124), we find that 

_ (4788) (.2040) _ .09655 _ 
"= (8021) (3399) 2726 ~ a 

We may therefore conclude that there was some tendency 
for the good health of yearling babies to be associated with 
a relatively high rent for the home. 
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The probable error for biserial r when q is not less than .05 is 
given approximately as 


.6745 i ) 
Probable error 


29 _ 
zr 
P.E.(ia.) = ——-——" 1 ot biserial | (128) 


6. THE COEFFICIENT OF CONTINGENCY 


When both characteristics are unordered the above methods 
cannot be used, and we must resort to the theory of probability 
in order to secure a measure of association. To illustrate this 
method, which is known as contingency, we may take a very 
simple correlation table such as the following, the numbers 
being taken small for convenience. 


For the cell marked in heavy lines, we shall have 


ey = o te = 10, and ie = Me 


The probability that a measure will fall in a given column f, 
is f./N (for example, $4), since f, of the N equally likely oc- 
currences are favorable. Similarly, the probability that a 
measure will fall in a particular row is f,/N (for example, 3'5). 
If now these two events are regarded as independent, the proba- 
bility for their combined occurrence is the product of the two 


probabilities above, or fh (for example, zon). Out of the 
N measures, therefore, we should expect N (2), or hls to fall 


in a particular cell if the characters are entirely independent. 
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For the marked cell the observed frequency is 5 as compared 
with an independence frequency of $5, or 3.5. The difference 
Sofy 
N 
departure of the two characters from complete independence, 
that is, of contingency. 

Professor Pearson* has defined the mean square contingency 
for the whole table by the relation 


, is thus a measure of the 


5 — 3.5=1.5, or in general fry — 


oy 
2 ( xy be) Mean square 
Xe S| eae contingency (126) 


function 


The x? function, it will be noted, is the same as that used in 
Chapter XIII. What is really wanted, however, is a coefficient 
varying between 0 and 1, and this is given by 


$2 x2 Coefficient of 
C=. ioe \ bes 0 {me sae | (127) 


contingency 


and called by Pearson the coefficient of mean square contingency. 
In the paper cited in the footnote below he shows that when 
both of the characters are normally distributed the limiting value 
for C for many categories is the correlation coefficient r. 

A form of (127) which is more convenient for calculation may 
be obtained by noting that 


2 
is "lis . 2 Bey + =Elt = g/_N, 


Sefy 
N 


where S’ is the squared sum and N results from the remaining 
terms. We may therefore write 


Ca SV —_N First computa- 
SN eer er {sh form for (128 a) 
ae contingency 


* Karl Pearson, ‘“‘On the Theory of Contingency and its Relation to Association 
and Normal Correlation,’ Draper’s Research Memoirs, Biometric Series I , 1904. 
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This is the formula recommended by Yule,* but the writer 
prefers the following one, obtained by setting S’= NS. 
Second compu- 


C= So sin form for +} (128b) 


Ss contingency 


g 
where S is now afte |. 
eae 


The calculation of C is very simple. If formula (128b) is used, 
the observed cell frequencies f,, are first squared, then the 
products f,f, are obtained, and the quotients f?,/f.f, worked 
out. The sum of these last quantities gives S, which may then 
be substituted in the formula. For the above problem the 
work may be arranged as follows: 


TABLE 73. SHOWING CALCULATION OF THE CONTINGENCY COEFFICIENT C 


A B G Wh 
Sey 1 2 3 
2 it 4 
x = 30 9 
GEiain, 0333 4444 
2, 5 fl 
4 25 
af 49 70 
0816 8571 
2 4 1 i 
4 16 1 
a 49 70 PA 
.0816 .2286 .0476 
3 3 
O 9 
21 
4286 
fe 7 10 é 3 20 


We thus find S = 1.7028, whence 
ee ee /A197 = 64. 


* Yule, Introduction to Statistics, p. 65. 
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For further illustration and in order to compare the result 
by this method with that by the correlation ratio, we shall also 
work out the contingency coefficient for the attitude and in- 
telligence ratings given in Table 69. A table of squares is of 
course necessary in all such work. 


CALCULATION OF THE CONTINGENCY COEFFICIENT FOR TULCHIN’S DATA 


ANNOYING UNSYMPATHETIC , SYMPATHETIC fy 

5 219 224 
5 25 : 47,961 
15,712 1,137,472 
-000330 -042165 

24 12 1213 1249 
4 576 144 1,471,369 
422,162 380,945 6,342,422 
-001364 -000378 -231989 

105 103 2451 2659 
3 11,025 10,609 6,007,401 
898,742 810,995 13,502,402 
.012267 -013081 444914 

131 108 1021 1260 
2 17,161 11,664 1,042,441 
425,880 384,300 6,398,280 
-040295 -030351 -162925 

73 82 174 329 

1 5,329 6,724 30,276 7 

111,202 100,345 1,670,662 
-047922 -067009 -018122 

fr 338 805 5078 5721 


S=1.118112, S—1=.113112 
113112 
Ga. eal 
lise 
In his text on statistics Mr. Yule* has shown that for ¢ 
categories each way the contingency coefficient has a maximum 


value of \ i: = : and that for such a table the largest value for 


C is given as follows: 


*G. Yule, Introduction to Statistics, p. 66. 
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Ift= 2, C cannot exceed 0.707. 
Ift= 3, C cannot exceed .816. 
Ift= 4, C cannot exceed .866. 
Ift= 5, C cannot exceed .894. 
If t= 6, C cannot exceed .913. 
Ift= 7, C cannot exceed .926. 
Ift= 8, C cannot exceed .935. 
Ift= 9, C cannot exceed .948. 
If t= 10, C cannot exceed .949. 


It is well therefore to restrict the use of the coefficient of con- 
tingency to 5 x 5 fold or finer classification whenever possible. 
For low association values, however, the above difficulty does 
not enter in any marked degree and the contingency method 
is always valuable in making a preliminary analysis of a table 
as illustrated by Professor Pearson in his Tables.* For the 
example on page 276 the method of contingency is as good 
as the correlation ratio, and the two results found are in fairly 
close agreement. 

The correction} for broad grouping in the case of the con- 
tingency coefficient becomes 


-C = 


| {neers to the con- 
? 


tingency coefficient | (129) 


TxcTyc broad grouping 


where r,- and 7,- are given by formula (128). For the problem 
in Table 66 Professor Pearson finds C = .291. The values for 
Tre and yc may be easily obtained from the work in Table 67, 
that is, 

Tre = V.9319 = .965 and tye = V.8267 = .909. 
Substituting these results in formula (129), we find that 


; oe 
ce eone O05 mie 


This is again in close agreement with Pearson’s result, -C = .334, 
worked out with another corrective formula (loc. cit. p. 181). 


* Pearson’s Tables, p. XxXXv. } Biometrika, Vol. IX, p. 180, 
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Professor Pearson concludes his paper with the remark that 
for contingency tables of 5 x 5 or 6 x 6 the corrective factors 
will be small, but for 4 x 4 or 3 x 3 tables the corrections are 
important and should always be made. 

The probable error of C is rather awkward to work out. It 
is given by the formula 


2 ae 
6745 Es ae! | Je goat error “ab 
P.£.. =——=| —-—— a 2 | 470! contingency 
VN (1 + 7° coefficient 
(n- 4 
zy N 


1 
where o? == 2 | ————_ |= S-1 
N 7 


(246) 


ee 


It is therefore necessary to work out W® by entering each cell. 


and pear 


7. CORRELATION FROM RANKS 


When the data are ranked in order of magnitude a rough 
measure of the correlation is given by Spearman’s formula, 
630, — v,)2 Spearman’s formula 
N(N®=1) ’ {oss on rank dif-+ (1381) 


ferences 


p= 


where v, and v, are the ranks of the X and Y items, 
respectively. 

The above formula may be readily obtained from the product- 
moment formula by ‘setting X =v, and Y=v,. By noting 
that the sum of the squares of the first N integers is given 
by N(2N+1)(N+1)/6, the remainder of the proof may be 
worked out by forming Dxy, o,, and o, and is left as an exer- 
cise for the student. It may also be shown that p ranges in 
value from —1 to 1. 
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The calculation of p is very simple as shown in the following 
example, which is limited to 10 cases for illustration. After 
ranking the items in the two series by the method of Chapter 
II, the computation may be arranged as shown in Table 74. 


TABLE 74. ILLUSTRATING THE CALCULATION OF CORRELATION FROM RANKS 


x Ys Vy vy (vz — Vy) (vz — Vy)? 
ila 117 2 6 —4 16 
169 153 3 1165) iE'55 2.25 
128 131 7 4 3 9 
141 105 5 "if —2 4 
106 (fal 9 10 —1 1 
146 130 4 5 -—1 i 
87 80 10 4) 1 il 
114 101 8 8 0 0 
187 153 i 1.5 — 0.5 0.25 
133 132 6 3 3 Y) 
43.5 
pai=? x 43.5 _ 4 _ 261 _ 74 


10 x 99 990 


One difficulty in the use of the above formula arises from the 
fact that a rectilinear form of distribution is assumed, that is, 
one frequency for each rank. In order to overcome this diffi- 
culty Professor Pearson* has given a corrective formula, 

T Pearson’s correction 
r=2sin—/p, + to Spearman’s rank + (132) 
6 coefficient 
which converts p into r under the assumption of a normal dis- 
tribution. This correction, however, is small, amounting to .018 
at most, and is usually not important because lack of normality 
may introduce an error several times as large as the correction. 

The student is urged to make up a short example in which the 
distributions are very skewed. The correlation coefficient and 
rank coefficient should be computed and the difference noted. 


* Karl Pearson, Mathematical Contributions to Evolution, XVI, p.12. Cam- 
bridge University Press, London. 
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Another objection to p appears when there are a good many 
ties inrank. This difficulty is illustrated by the following series: 


| 


If the value for p be worked out it becomes .50 instead of zero 
as found by the product-moment method. The above example 
is, of course, extreme, but a large proportion of ties in rank will 
generally be found to produce a correspondingly large error. 

When the data are necessarily given in the form of ranks, and 
when there are not many ties in rank (say less than one fifth of 
the items), Spearman’s rank formula may be conveniently used 
to give a rough indication of the correlation. While the arith- 
metic is simple for short series, the ranking and squaring become 
laborious* beyond 50 cases. The method is, therefore, recom- 
mended for about 20 to 40 cases. With more data the product- 
moment method is theoretically better and more rapid. 

The probable error of p is given by the formula 


FO Probable error of 
= ae ’ | seermans a (133) 


coefficient 
from which it appears that p is a more unreliable measure of 
correlation than r. 


P.E., 


EXERCISES 


1. Work out the product-moment correlation coefficients for the 
problems of Exercise 1, Chapter IX, using the method illustrated in 
section 2 of the present chapter. Assuming that the means of the 
arrays are also needed, compare the total amount of arithmetic with 
that required by the use of the correlation form. 


_ *A useful table for the calculation of p is given in Tables for the Rank Differ- 
ence Method. The Scott Company Laboratory, Philadelphia, 1920. 
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2. Work out the contingency coefficient for the following problem: 


CORRELATION BETWEEN OCCUPATIONAL STATUS OF PARENT AND 
NATIVITY OF CHILD 


NATIVITY 

OccUPATION TOTAL 

in 2 3 4 5 
A 8 8 19 15 50 
~- |B 4 28 15 51 iG 115 
Biv iC 18 133 26 65 43 285 
a D 11 73 19 35 12 150 
Total 33 242 68 170 87 600 


Key: A = Professional Class 


B = Merchant 
C = Skilled Labor 


D = Unskilled Labor 


1 = Child born in United States 
2 = One parent born in United States 
3 = Both parents born in United States 
4=One grandparent born in United 
States 
5 = Both grandparents born in United 
States 
(C = .294. Ans.) 


3. Compute the coefficient of contingency for the following table: 


CORRELATION BETWEEN NATIVITY AND MENTAL LEVEL OF CHILD 


MENTAL CATEGORY | 
NATIVITY TOTAL 
B D N Ss VS 
4 5 10 4 19 
3 3 10 12 3 28 
2 5 24 52 17 5 107 
il 5 9 31 5 52 
Total 10 36 98 44 1 206 
Key: F= Feeble-minded N = Normal 
B = Border-line S = Superior 
D= Dull VS = Very Superior 


(Gi= 743845 “Ans') 


Note. The data for Exercises 2 and 3 were furnished by Mrs. Irene Lange. 
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4. Work out the correlation from ranks for the Otis and Terman 
test scores of Exercise 1, Chapter II. (i= SOUS, AWS.) 

Compare the amount of arithmetic with that in the product- 
moment method. 


5. Do the exercise suggested at the bottom of page 278. 
6. Compute 7, for the following table. 


NUTRITION 
HEALTH TOTAL 
Cc B A 
V.R. 4 11 a 5 20 
= Rs 56 56 12 124 
N. 50 49 16 115 
R.D 140 . ger 168 37 345 
D 94 89 16 199 
Ve 6 5 1 12 
AMO 5s 350 378 87 815 


(Ney = 096. Ans.) 


7. Work out C for the data of Exercise 6. (C= .119. Ams.) 


8. Compute r for the data of Exercise 6, using formula (120). 
, (r = .0768. Ans.) 


* 


CHAP RE Rax Vi 
PARTIAL AND MULTIPLE CORRELATION 
1. THE MEANING OF PARTIAL CORRELATION 


In dealing with correlation thus far the relationship of only 
the two associated characters has been considered. Each of 
these, however, is dependent upon many other factors which 
may influence the observed correlation to a considerable extent. 
The problem of partial correlation is to find the relationship 
between two variables when the influence of other variables 
has been eliminated or when. such factors have been held 
constant. 

The conditioning factors may be eliminated by experimental 
procedure or by the use of a formula as illustrated by the fol- 
lowing example. The factors considered are mental age, chron- 
ological age, and ossification ratio, the latter being an index of 
anatomical development based on measurements of the wrist 
bones. The problem is to discover the relationship between 
mental and physical development when the influence of age 
has been eliminated. 

Data for the experimental solution of this problem were fur- 
nished by records of the Laboratory Schools of The University 
of Chicago, the work being done by Miss Ethel Abernethy* 
and others. The children were all measured within a few days 
- of each birthday. In the table given on page 284 it will be noted 
that not one of these coefficients is significant in comparison 
with its probable error. We may therefore conclude that for 
children of the same age, carpal development and mental age 
are entirely unrelated. 

* Ethel M. Abernethy, ‘Correlation in Physical and Mental Growth,” Journal 


of Educational Psychology, October and November, 1925. 
283 
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TABLE 75. CORRELATION OF MENTAL AGE AND OSSIFICATION RATIO (GIRLS) 


CHRONOLOGICAL AGE NUMBER OF CASES CORRELATION COEFFICIENT 
6-12 120 + .016 + .062 
13 44 — .137 + .100 
14 62 — .139 + .084 
55 29 — 174 + .122 
16 45 — .022 + .101 
17 37 + .041 + .111 


Turning now to the method of partial correlation, we may 
designate the three variables as follows: 


1 = ossification ratio, 
2 = mental age, 
3 = chronological age. 


The correlation between 1 and 2 for 8 fixed is required and is 
given by the formula 
TiQiee 13 928 


Va —18)(0— 73) 


By taking several hundred cases ranging in age from 5 to 
20 years, the three necessary correlations were found to be 
rig = .75, r13 = .87, and 723 = .83 (girls). When these values 
are substituted in the above formula we find that 


15 = 8155.83 0) eae 


2.3 =: 
VIU1— 687)7)]01 — (.88)?] 


For 320 cases the probable error of this result is .037, and for 
360 boys we find, similarly, ri2.3 = .089 + .035. Neither of these 
coefficients is three times its probable error, so they are to be 
regarded as insignificant. The above method then gives re- 
sults in entire agreement with the experimental procedure of 
Miss Abernethy. It should be noted that the original correla- 


12.3 = 


Partial correlation 
variables 


coefficient for ree | (134) 


* When a calculating machine is available Miner’s Tables for V1 — r2 (Johns 
Hopkins Press) are most convenient. For logarithmic calculation using Holzinger’s 
Tables, VII, see section 2. : 


i} 
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tion of .75 between mental age and ossification ratio is thus 
due entirely to the correlation of each of these variables with 
chronological age. 

In Chapter IX, section 8, it was shown that some selection 
lowers the correlation between traits. Thus if a narrow age 
range were used we should expect altered correlations between 
ossification ratio and age and between mental and chronologi- 
cal age, with a resulting lower correlation between the physical 
and mental traits. By restricting the range of age to zero, we 
reach rigorous selection the effect of which has been noted 
above. Partial correlation, then, may be regarded as a method 
for obtaining relationships under rigorous selection of certain 
conditioning variables. 

While it is usually best to isolate factors experimentally it 
is often not advisable to do so because of the great reduction 
in the number of cases. The chief factors to be controlled in 
the above laboratory data are age, sex, and race. If all these 
are eliminated by selecting the cases, groups of 8 to 15 result,. 
and correlations based on such small numbers are almost 
worthless. The method of partial correlation makes it possible 
to use a much larger body of data, eliminating the conditioning 
factors by means of formulas. It is therefore a very useful and 
powerful tool in analyzing the relationships in a set of corre- 
lated variables. 

Partial correlations may be worked out for any number of 
variables, but the arithmetic beyond four variables becomes 
very lengthy and tedious. Of the various methods of computa- 
tion, solution by logarithms is probably best for students who do 
not have the use of a calculating machine. In the next section 
we shall therefore give examples of three-variable and four- 
variable correlations, using logarithms and straight arithmeti- 
cal substitution. 

One important caution to be observed at the outset is to use 
the method of partial correlation only in case the tables from 
- which the original coefficients are obtained are sensibly linear. 
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The procedure is then to find the product-moment correlations 
for all the variables studied, test the tables for linearity by 
the method of Chapter X, and then substitute the coefficients 
obtained in suitable formulas, provided the regressions are all 
sufficiently linear. In case non-linear relationships are found 
other methods must be employed, such as the procedure de- 
scribed in the last section of Chapter X. 


2. PARTIAL CORRELATION FOR THREE AND FOUR VARIABLES 


In dealing with several variables it becomes necessary to use 
a suitable notation for the various coefficients which arise. If 
the variables are designated as Xi, X2, X3--- Xn, the original 


correlations riz, 713°: + 123, T24 +++ T(m—1)n are known as co- 
efficients of zero-order, and the subscripts are called primary 
subscripts. 


Correlations such as 112.3, 723.1, and ri2.n are regarded as 
coefficients of the first-order, while the correlations 712.34, 723.14, 
and 734.12 are said to be coefficients of the second-order, and so 
on. The subscripts following the decimal point are known as 
secondary subscripts. 

The general formula for the partial correlation of the order 
(n — 2) for ~ variables is given by 


_ 112.34... (n—1) — Tins4..-(n—1)T2n,34.-- (n—1) (135) 


112.34...n= eee ee eee 
/ [1 oa Tins4 wy an—n] [1 = Te n34. oe n—v] 


{Partial correlation coefficient of the order (x — 2)} 


This gives the correlation between variables X; and X2 when 
the remaining ” — 2 variables have been held constant. 

Yule* has shown that the order of the secondary subscripts 
is indifferent, so that ri2o.34=T12.43, aNd 119.345 = 119.354 = 112.543, 
etc. These alternative formulas, as we shall see, furnish very 
useful checks on the arithmetic, since they give independent . 
solutions for the various partial coefficients. 


* Yule, Introduction to Statistics, chap. xii. Charles Griffin & Co., London, 1924. 
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Using formula (185), we may now write down all the pos- 
sible correlations from three variables. These are evidently 


T12 — 113723 


2.3. SS) 136 
V{1 — rif — 73] Hae) 


T13 — 112723 
OS ——_ 3] (136 b) 
V [1 — rie][1 — 73] 


193 — 12713 
and 3 = (136c) 
V [1 — r2,][1 — 73] 


{Partial correlations of first-order} 


Similarly, in the case of four variables we shall have 


712.3 — 114.3724.3 ; (137) 
Jl = ri sl[1 — 12.5] 


112.4 — 113.4723.4 : (137 b) 


12.43 = 5 
V [1 ay Ti3.4][1 = 733.4] 


at 713.2 — 114.2734.2 
113.24 = D9 aE 
V [1 = rial[1 = r34.2| 


113.4 — 112.4739.4 
and WQYy8 SS SSS SS 137d) 
ph y/[1 om Ti2.4\[1 me T39.4| 


ete. {Partial correlations of second-order} 


712.34 = 


(137) 


Since the two primary subscripts may be selected from four in 
’ 4C2=6 ways, there are evidently six possible partial correlations 
of the second-order with four variables. Each of these six may 
be obtained in two ways as acheck, for example, 712.34 = 112.43- 
The total number of arrangements of the subscripts for four 
variables is therefore twelve. The student should write all these 
out in full in order to become familiar with the formula and 
with the notation employed. 

As an illustrative problem we shall take some results found 
by Mr. Cyril Burt. The variables considered may be defined 
as shown in the list on the following page. 
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X1 = mental age on an English revision of the Binet scale, 

Xe = school attainment expressed in educational age, 

X3 = intellectual development as measured in age units by 
Burt’s reasoning test, 

X4 = chronological age. 


The observed correlations of zero-order may be arranged as 


follows: 
TABLE 76. BuRT’S* INTERCORRELATIONS 


X1 X2 | X32 


| 
Xi 
X2 91 
X3 .84 BRS 
X4 .83 .87 -70 


Burt does not give the tables on which these correlations are 
based, but we shall assume they are linear and proceed to the © 
calculation of the partial coefficients. 

The total number of different correlations of first-order is 
evidently twelve, since two variables may be selected from four 
in six ways, each pair furnishing two correlations on account of 
the interchangeability of the secondary subscripts. 

In working out these values it is best to arrange the calcula- 
tion as in Table 77 so as to identify each step in the computa- 
tion. In the following work the logarithms of V1—r? were 
taken from Holzinger’s Table VII and rounded off to four places. 
Products such as .84 X .83 = .6972 have also been rounded off 
to three figures (.697), and four-place logarithms used for the 
remainder of the computation. Greater accuracy than this is 
unnecessary when the original coefficients are correct to only 
two places (see Chapter V). 

The first item in column (2) is obtained by forming the 
product .84 x .75 = .630, that is, the product of the coefficients 
in the first group of three not in line with .91; next, .910 — .630 
= .280 gives the first entry in column (8) ; the logarithm of .280 is 


* Cyril Burt, Mental and Scholastic Tests, p. 182. King and Son, Ltd., London, 1921. 
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9.4472 — 10 as shown in column (4). This completes the calcula- 
tion up to the logarithm of the numerator. The logarithm of the 
denominator of riz.3 is now obtained by adding the logarithms 
(from Holzinger’s Table VII) of V1 — r2, and V1 —r3, which 
are listed in column (1). The first entry in column (5) is then 
found by adding 9.7345—10 and 9.8205—10, giving 9.5550—10. 
To complete the calculation for ri2.3 it is only necessary to sub- 
tract the logarithm of the denominator from the logarithm of 
the numerator (9.8922 — 10 in column (6)), and look up the cor- 
responding number, or anti-logarithm (.780 in column (7)). The 
remaining correlations are calculated in a similar way. 

In finding the coefficients of second-order the first-order 
values just found may again be arranged in convenient groups 
of three, and the same scheme of calculation carried out, as 
illustrated in Table 78. A complete check on the arithmetic 
is given by two solutions for each second-order coefficient with 
formulas such as (187). Each of the six second-order values is 
thus worked out twice, as shown in the table on page 291. 

With zero-order coefficients correct to only two places no 
greater accuracy can be expected in the higher-order coefficients, 
' but three-place values have been used in Table 78, so that the 
final results may be rounded off to two places. 

The interpretation of coefficients such as those found is 
rendered difficult because of the fact that the first three vari- 
ables are all measures of the same thing to a certain extent,. 
and holding one or more of them constant gives a result of 
doubtful meaning. The variables X; and X3 are both measures 
of intelligence, but r12.34 = + .61 while 723.14 = — .08, the latter 
coefficient being negligible. Burt interprets the coefficient .61 
as follows: 

“With both age and ‘intelligence’ (reasoning ability) con- 
stant, the partial correlation between school attainments and — 
Binet results remains at .61---. There can, therefore, be little 
doubt that with the Binet-Simon scale a child’s mental age is 
a measure not only of the amount of intelligence with which 
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he is congenitally endowed . . . it is also an index, largely if not 
mainly, of the mass of scholastic information and skill . . . which 
he has accumulated in school.” (Op. cit. p. 182) 

The coefficient 723.14 = — .08, on the other hand, would seem 
to show that for children of given chronological and mental 
ages, reasoning ability (or “intelligence” as Burt calls it) is 
entirely unrelated to scholastic achievement. Burt and others 
have claimed that this result shows the reasoning test to be a 
pure measure of intelligence “independent of schooling.” If 
mental age, however, is a measure of both “‘intelligence’’ and 
achievement, the partial correlation above will necessarily be 
low because, by fixing Xi, the variables Xz and X3 are thereby 
both restricted. It may also be noted that ‘‘schooling”’ as used 
here is a measure of relative achievement in school. The fact 
that the Binet test has higher correlation with such achieve- 
ment than does Burt’s test, indicates that the former is the 
better guide in predicting scholastic success and is therefore a 
better intelligence test for practical purposes. 


3. PARTIAL REGRESSION EQUATIONS FOR THREE VARIABLES 


When two variables, X; and Xo, are involved it has been 
shown in Chapter IX that the equation for predicting the most 
probable value of Xi for a given value of X2 is given by the 
regression equation 


X= Tis = X2 + constant = bj.X_ + constant. (138) 
{Regression equation for two variables} 


This same method of prediction will now be applied to sev- 
eral variables, Xi, X2, X3--- Xn. The regression equation for 
estimating X, from the remaining  — 1 variables is 


Xi = dies... nXet digea---nX3t+++ + bings---n—-yin tC, (139) 


which is known as a linear function of the X’s. The quantities 
biz.34--+n, b13.04--+n, + -01n.23- ~~ (n—1) and C are constants to be 
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so chosen that the squared differences between the observed 
values X1 and the predicted values X; shall be as small as pos- 
sible; that is, so that 2 (X1—X1)?=a minimum. These differ- 
ences, X; — Xi, are also known as errors of estimate or residuals. 

By applying the method of least squares in a manner similar 
to that in Chapter IX, it may be shown that 


bj2.34---n = 112,34-- 


01.34...n Ev Sa (140) 
where 


"09.34. of the order (n — 2) 


(1 = Tip) (1 = 113.2) eas (i = Ti n.23 B60 Qa): 


{Standard deviation of the order (n—1)} 


O193---n=91 (141) 


The probable error of estimate is of course given by 
P.E.est = 6745 0193---n. (142) 


{Probable error of estimate} 

The value furnished by (140) is known as a partial regression 
coefficient. It gives the average change in the dependent vari- 
able (left member of the equation) for a unit change in the 
variable to which it is attached, when all the remaining vari- 
ables are kept constant. 

It will be noted that the subscripts are so arranged that the 
position of any regression coefficient is uniquely determined. 
Thus for 613.24...» the primary subscript 1 indicates that X is 
the dependent variable, while 3 shows the variable X3 to which 
the coefficient is attached. The remaining secondary subscripts 
following the decimal point merely show the number of variables 
involved, and their arrangement is a matter of indifference. 

In order to illustrate the above formulas we shall next write 
out in full the equations for three variables. From equations 
(140) and (141) we find that 


Xi = bi2.3X2 + bis.2X3 + C1 
Stipa Xo tris 0 aa Xs C1 
02.3 03. 


i O71 V1— in 47 o V1 —1r3 
= 112.3 7 X2 + 113.2 j=; 
g2 V1— ry o3 V1—r2, 
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Upon substituting the values or ri2.3 and 113.2, we obtain® 


= _ 01 ("12 — 113723) 


x Vets ae = 7272) yo 4 ¢, (143) 


02 1— ee 03 1 — 
and, similarly, 
=> O92 ("12 — 1137 O92 (%93 — 112113 
X, = 22 Cantey) DEN aad ES Ey ~ ) X54 C2 (144) 
1 — Ti3 03 1 — Ti3 
=> 03 (113 — Tier 03 (T9293 — Tier . 
aval eae 03 (713 12 23) ye 3 (Tas i2 13) hd Cee (145) 
O1 1 = Tig 02g 1 — Ti2 
{Regression equations for three variables, short form} 


The general expression for C is given by 
C=M,— bj2.34--- nM2—613.24---nM3—-+-—bings---(n—-1)Mn. (146) 


{Constant term in regression equation} 

When an estimate has been made with a regression equation 
it is necessary to know something about the reliability of the 
obtained prediction. The standard error of estimate for pre- 
dicting X; from n — 1 other variables is the standard deviation 
of the residuals given by equation (141) and is interpreted in 
the same way as 0 V1—r? of Chapter IX. For the above 
equations we thus find 


ee) De // eee eae 
01.23 = 01 V1 — Trig V1 — Trigg = 011 1— ris 1—ris, 


or 
Ee 1 — rip — rig — 733 +2 Tieristes 01 V Sigs | 147 
EEN 11h smart 


2s 1—riyp — r}3 — 12 +2 rieFist23 _ 2 V Sigs 
02.13 = 92 ace inlke Lila al ee teh 


1—r% V1 


oes 1 — rig — Tis — rs +2 rieristes 63 V Sis 14 
\ 1-72, av meer > (140) 
== Fig 


{Standard deviations of the second-order} 


» (148) 


where Si23 is the sum of the terms in the numerator of (147). 
The probable errors of estimate are of course obtained by mul- 
tiplying the above values by .6745. 
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In dealing with a three-variable problem for which only zero- 
order coefficients are required; formulas (143), (144), and (145) 
will be found convenient. When the partial correlations are 
available, however, formulas like (139), (140), and (141) will be 
found much simpler to employ. The computation for the former 
equations will be illustrated by an example in which success in 
first-year college work is predicted by the average of four years’ 
work in high school and an intelligence test. The three variables 
and zero-order coefficients obtained from a sample of 75 cases 
may be given as follows: 

X;=criterion of success = average mark for first-year col- 
lege work. 

X2 = predictor = average mark from four years in high school. 

X3 = predictor = score on the Brown Intelligence Test. 


= 78.0%, a1 = 10.21, fio = .666; 
M2= 87.2%, 02> 6.02 Tor 713 = .750 ; 
Mz; = 32.8-pts., o3 = 10.35 pts., ro3 = .628. 


It should be noted that the number of cases (N = 75) is too 
small to give very reliable results, but the above example will 
‘be used to illustrate the calculation. 

By Blakeman’s test the correlations all proved to be linear, 
so that the method of partial correlation is justified in this 
problem. 

The equation required is (148), or 

eae 1 (Tig — rist2s) ae O1 (13 = ae ve seep. 
pen ol eae ee gs 1-— 
the computation for which may be oa as in the table on 
page 296. 

Inasmuch as the zero-order values are given to three and four 
significant figures, a five-place logarithm table has been used. 
The logarithms of 1—r? are given directly by Holzinger’s 
Table VI. It will be found necessary to observe the arrange- 
ment of the quantities in the formula very carefully in order to 
combine the proper logarithms. 
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TABLE 79. SHOWING CALCULATION OF FIRST-ORDER REGRESSION 
COEFFICIENTS 


(2) 
PRODUCT 
Tr 


(1) (5) (6) 
r o 


(4) (7) 
ENCE Loe (r—rr) Loc o Loe (1—r?) 


12 .666 4710 .1950 9.29003 1 
13 -750 4182 .3318 9.52088 2 


10.21 1.00903 = 
6.02 0.77960 


23 -628 — — = 3 10.35 1.01494 9.78220 
(8) (9) (10) | ey 
Log NUMERATOR Loc DENOMINATOR Log CoEFFICIENT i eta 
[CoLs. (4) AND (6)] [Coxs. (6) AND (7)] [CoLs. (8) AND (6)] Pisa Ounes 
0.29906 0.56180 9.73726 bi2.3 .5461 
0.52991 0.79714 9.73277 bi3.2 .5405 


Using a calculator and Miner’s Table for 1 — r?, the above 
computation becomes very much easier : 


10.21 x 195 __ 1.99095 _ 
b2.3= § 09 x 605616 3.6458 tol 

10.21 x .8318 — 3.38768 _ 5405 
10.85 x .605616 6.2681 ~ ; 
The value of C as given by (146) becomes 


C= 78 — .5461 X 87.2 — .5405 X 32.8 = 12.65 .-. 12.65%, 


the unit being the same as for Xj. 


Using formula (147), the probable error of estimate may also 
be worked out from the zero-order coefficients : 


and biz.2 = 


Si23 = 1 — rig — 7 — 7S + 2 rierisre3 = .226982. 


log V S123 = 9.67795 

log 71 = 1.00903 

log .6745 = 9.82898 

log prod. = 0.51596 

log \/1 — r2, = 9.89110 
log .6745 01.23 = 0.62486 
. P. H.1.23 = 4.22% 
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The complete regression equation therefore becomes 
“Xi = 546 Xo + 540 Xs + 12.65% + 4.22%, 


It should be noted that the coefficients .546 and .540 may not 
be compared directly, but that each gives the average change 
in X, for a unit change in X2 and Xs, respectively, when the 
other variable is held constant. Thus an increase of 1 per 
cent in the high-school record is accompanied on the average 
by an increase of .546 of 1 per cent in the college record, while 
an increase of one point on the Brown test is accompanied by 
an increase of .540 of 1 per cent in college standing. 

In making a prediction with the above equation it is only 
necessary to substitute values for Xz and X3. A student, for 
example, may enter the University with a high-school average 
of 80 and a Brown test score of 40. Upon substituting these 
values in this last equation the most probable standing of the 
student in college at the end of the freshman year will be given 
by 77.93 + 4.22. It is therefore an even chance that his college 
rating will be anywhere from 73.71 to 82.15, and the importance 
of the probable error of estimate is seen in placing a reservation 
upon the accuracy of the prediction. For a second student with 
a high-school average of 90 and a Brown score of 50 we find, 
similarly, X1= 88.79 + 4.22. Here it is an even chance that this 
student’s college average will be between 84.57 and 93.01. 

The question sometimes arises, Why predict the college stand- 
ing of students when it is already known in this problem? The 
standing of only the sample observed is known, however. This 
criterion is used as a basis for determining the regression equa- 
tion by means of which predictions may be made with similar 
groups for which the college standing is unknown. It is assumed, 
therefore, that other groups will possess the same characteristies 
as the sample studied so that the equation may also be applied 
to them. Needless to say, this assumption is often not fulfilled, 
but the forecast by means of the regression equation is one of 
the best that can be made on the basis of past experience. 
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4. SomME CAUTIONS IN THE USE OF REGRESSION EQUATIONS 


Estimates by means of regression equations may fail to be 
reliable for several reasons : 

a. The trend of the data in the observed sample may be im- 
perfectly represented by a linear function. By testing all of the 
zero-order regressions for linearity this objection may be over- 
come. 

b. Data to which the regression equation is applied may not 
be comparable with those of the sample from which the equa- 
tion was derived. This difficulty may be illustrated by the case 
of a high school with unusually low or high standards of mark- 
ing. The use of the equation in the last problem in such a case 
would give misleading results, since the data were obtained from 
a normal group. It would probably be necessary to work out a 
separate equation for such schools. 

c. The correlations of various orders may be so small that the 
probable error of estimate becomes relatively large. If this con- 
dition prevails predictors having higher correlation with the 
criterion must be sought, or their number must be increased, as _ | 
is evident from inspection of formula (141). 

d. The number of cases in the sample furnishing the predict- 
ing equation may be so small that the regression coefficients are 
unstable. The probable error of bi2.3=.546 in the last prob- 
lem is given by formula (97) of Chapter XIII, that is, 


P. B.ai9.2 = .6745 —2 23 
12.3 oa /N 


oN S123 
o2(1— 133) VN 
or bi2.3 = .546 + .104. 


The value of the regression coefficients based on a very large 
number of cases might therefore differ considerably from the 
values actually found in a small sample of 75, which was used 
here chiefly for numerical illustration. 


= 6745 = 104, 
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The above difficulties may be illustrated by an example taken 
from a study by Dr. W. R. Burgess.* The predicted variable 
was years of teacher training beyond high school, regression 
equations for which were formed with time as the predicting 
factor. In Fig. 69 the data and regression line for one state 
are shown. By means of the latter, Dr. Burgess predicted that 
in 1950 the average teacher in Montana would have 1.36 years 
of training beyond high school. 

It should be observed, however, that the data do not furnish 
a linear trend for the period studied, and the regression line 
is therefore a bad fit. 
Furthermore, the forecast 
has been obtained from a 
ten-year period and has 
been projected forty years 
beyond the range of ob- 
servation. The assump- 
tion that the educational 
conditions in Montana  .4 
Fea patio O50 aaill 1910 19111912 1918 1914 1151916 19171918 19191920 
be comparable with those Fic. 69. Regression line for Burgess data 
from 1910 to 1920 is un- 
warranted and the prediction is therefore probably worthless. 

The correlation between teacher-preparation index and time 
is necessarily small, thus giving a relatively large error in esti- 
mate, and finally, although the total number of observations is 
large, the probable error of a regression coefficient as small as 
.006 is such as to render its value of doubtful significance. 

If predictions of the above type are to be made, the trend for 
the data studied must be approximately linear and the projec- 
tion made only a short time beyond the range of observation. 


* W. R. Burgess, “Trends of Teacher Preparation,’ Journal of Educational 
Research, October, 1921, p. 181. 
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5. PARTIAL REGRESSION EQUATIONS FOR FOUR VARIABLES 


When four variables are involved, the regression equation 
for predicting X, from X2, X3, and X4 may be obtained from 
formula (139) and written in the form 


— Xa+C. 


> 01.34 
X1 =112.34 Xe+ eves =e +Tis. Aap 
02.34 a C4. 


The standard deviations are obtained from (141), giving 
V1 — ris, 3 
Xy= Me. ot ae YE 
BPS bse 133 V1 — 134, 3 
Vi V1=TV1—Tise V1 — ris, 2 y, 
a2 eee oF 133 ve —T3a0 
oN ee 
rig V1 Tis2 ee ee EV - €. (150) 
Ni 134 JE 3a, 2 


{Regression equation in four variables in terms of partial correlations} 


+ 113. oe 


+ 114, 23 


In order to calculate the regression coefficients for equation 
(150) it is first necessary to compute the required partial-corre- 
lation coefficients of first-order and second-order and then sub- 
stitute in the above expression for the regression coefficients. 
The value of the constant term C is then readily determined 
from formula (146). 

This procedure may be the easiest and most direct, especially 
when the partial correlations are needed for other purposes. 
Another method will next be presented, however, because stu- 
dents often find it very convenient. The new formulas have 
two advantages over (150) : all the operations involved are fully 
expressed, and nothing but zero-order correlation coefficients 
and standard deviations are required in the calculation. It is 
therefore only necessary to make straightforward substitutions 
of these values. Since the formulas for the correlation and re- 
gression coefficients involve the same expressions, we may begin 
with the former. 
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Returning first to the general formula (135) for partial cor- 
relation, we may write 


(RIBS 1 ILL, SYPGMLS 
112.34 = z : 


(1 aa Tiss) (1 = 134.3) 


Upon substituting the values for the coefficients of first-order 
in this expression, we find 


2 

eee: r12(1— 134) — rigr23 — r1aroa + 734(risre4 + Y14723) 

1234 rn re nee 
ye ee a ee 

(1—ris —1 14 —134 +2 ri3riatz)(1—133 —134 —134 +2 To3%oasa) 

S12.34 


ey S134S234 OE) 


and, similarly, 


2 - 
113(1 — 134) — 10703 — Tia?sa + Taa(T1a¥34 + 114% 23) 


3.465 

=H 2 2 2 2 2 2 

qa —T12—-Ti4—T2a +2 40% 14r 2a) (1 —7133—T24—V34 +2 123194734) 
S13.24 
= (151b) 

WV S124S234 

and 
2 

pees 1ya(1 — 133) — T1eToa — 113734 + 123 (112734 + 113724) 


aff 2 2 2 2 2 2 
qa —T12—113—123 +2 110713123)(1 —123 —Tea—134 +2 Tost 2434) 


S14.23 


=——_ > 
V S123:S234 


{Second-order correlation coefficients in terms of zero-order coefficients } 


(151c) 


where Si2.34= ri2(1 =a ) — 113723 — Y14To4 + 734(T13724 + 714723) 
and Siza=1—72,—12,—72,+ 2 risriarga, etc., as used in for- 
mula (147). It will be noted that seven different expressions of 
the form indicated by Si2.34 and Si34 are required for the com- 
putation of the three correlation coefficients. 

Similar expressions for the partial regression coefficients are 
next obtained by the use of a general reduction formula, 
bie.34---n 

& 119.34---(n—1) — T1n.34--: (n—1)72n.34--- (n—1) 91.34---(n=1), (152) 
Te yeast act) 02.34---(n—1) 
{Reduction formula for regression coefficient} 
Applying this formula and making use of (141), we find 
Moen 1 [eee = aa Vi= 1%, 


? 
2 
02 1 ae Yo4.3 N/A —_ en 
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and upon substituting the values for first-order correlations, 
there results 
01 | M12 (1 — 34) — ris¥23 — Y1ar2a + 134 (T1s¥24 + 114728) 
bie. oo PEF Le reo iLL ee. 
1 — 133 — 144 — V3 + 2 rear earsa 


Ss 
_ 9112. 34 | qd 53 a) 
oa Sosa 
and, similarly, 
eos ot — 134) — N1e¥e3 — 714734 + T24(Tiersa + ne) 
13.24 Se EEE SE Se ee ee ee 
1 — 133 — 194 — 134 + 2 resroarsa 
ats 
Su 13.24 24, (1 53 b) 
~ 03 S234 
and 
b o1 le (1 — 733) — rierea — T1734 + 123 (T1er3s4 + ae 
14.23 = +6 aa en al 


1 — 134 — 134 — 134 + 2 rest earaa 


01S 
— 91 914.23 | (153¢) 
o4 So34 
{Second-order regression coefficients in terms of zero-order coefficients} 


The advantage of these last equations becomes apparent from 
the fact that only four different quantities, Si2.31 and S234, are 
required for the complete solution of a given regression equation. 
The constant term C is of course given by 
C= M1 — b12.34M2 — b13.24M3 — b14.23Ms. 


The standard error of estimate may be written 


01.234 = 91 V(1 — ris) (1—rige)(1— i423) 
by the use of equation (141). Upon substituting the value for 
ri4.o3 from (151¢c) and expressing ri3.2 in terms of zero-order 
coefficients, there results, after simplification, 


VSS naece Standard deviation of 
01.234 = 0} ————— ; > third-order in terms of + (154) 
(1 — 133) S234 zero-order coefficients 


and by permuting the subscripts, 


a2 
02.134 = 02 Si23S134 — Soa.is : S24.13 ete. 
(1 — rg) Sisa 
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The complete solution of the regression equation in four vari- 
ables, together with the partial correlation coefficients, is thus 
accomplished by calculating seven quantities, Si2.34 and S134, 
based upon zero-order values. 

As an illustrative example we may take a four-variable 
problem worked out by Mr. J. W. Hoge in a term paper. The 
problem was to predict success in plane geometry from alge- 
braic ability, arithmetical ability, and intelligence. The group 
consisted of fifty high-school sophomores. 

A list of the variables used may be given as follows: 

X= criterion = cumulative score on eight units of work in 
plane geometry covering a six months’ period. 

X2= the cumulative score on three algebra tests covering 
the four fundamental operations and the solution of linear and 
quadratic equations. 

X3 = score on the Reavis-Breslich arithmetic test. 

X4 = intelligence quotient on the Otis Self-Administering test. 

The zero-order correlation coefficients, standard deviations, 
and means are given in the following table: 


TABLE 80. DATA FROM MR. HOGH’S PAPER 


ZERO-ORDER STANDARD PARTIAL 
CORRELATIONS DEVIATIONS MEANS Le eas 
Tag = .o4 G1 = 35.5 My, = 224.4 
Files = 8) O2 = 6.87 Me = 41.32 713.2 = .258 
ria = AL 63 = 21.28 M3 = 81.52 714,23 = .234 
723 = .58 o4 = 8.49 M4 = 113.88 
724 = .29 
734 = .50 


Returning to equations (153), we shall first work out the 
regression coefficients for the equation 


X1 = b12.34X2 + b13.04X3 + b14.23X4 + C. 
Upon substituting the zero-order correlation coefficients we find 
S12.34 = .192, Sis.c4 = .0779, S14.03 = .1095, and S234 = 498. 
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The values for the regression coefficients may then be written 


35.5. 192 _ 35.5 0779 _ 
biz.s4 = & a7 X “ggg = 1-99 bis.24= 7 9g X “ggg — “761, 


_ 85.5. 1095 _ 
bia. 23 8. 49 >< A498 —— 919. 


The equation for the constant term C then gives 

C = 224.4 — 1.99 x 41.82 — .261 x 81.52 — .919 x 113.88 = 16.2, 

and the complete regression equation is thus written, 
X,=1.99 Xo+ .261 X3 + .919 X44 16.2. 


In order to obtain the standard error of estimate o1.234, the 
quantity Si23 is required, and the latter will be worked out in 
computing the partial correlation as follows: 


Sitios = .409, | SD e000, ie 


Substituting the necessary values in equations (151), we find 


and 


192 0779 
Ae eT EES BEY oye Sep ee SLOT Bee 
28 5/543 X .498 oo WOaY Sor aaage 
and pee eee 
Doe en) ASO nd OF Maen 


The value for 1.234 from formula (154) becomes 
[218622 — 011990 _ 
NDS 6636 x 498 
so that the probable error of estimate is .6745 x 28.1 = 19.0. 

This large error of estimate is due to the wide range (134-276) 
and large standard deviation (385.5) of the cumulative geometry 
scores as well as to the low intercorrelation of the tests. 

By dropping intelligence as a predictor the regression equa- 
tion becomes X; = 1.99 X2+ .444 X3 + 106.0 + 19.5, with only 
a slightly larger error of estimate. The estimate from three 
variables is thus more reliable than the estimate from two 


variables, but on account of the small difference it is hardly 
worth while using more than the two predicting variables in 
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such a problem. Correlations of the order .5 between criterion 
and predictor are usually necessary before additional variables 


increase the reliability of the estimate to any appreciable extent. 
In order to assist the student in working out any regression 


coefficients by the above method with four variables, a com- 


plete set of values is given in Table 81. 


TABLE 81. REGRESSION COEFFICIENTS OF SECOND-ORDER EXPRESSED 
IN TERMS OF ZERO-ORDER COEFFICIENTS 


ier O1 E 12(1 — r342) — rigr23 — riaf2oa + rga(risrea + riarae)) O01 Si2.34 
C2 1 — ro32 — rea? — rg42 + 2 rosrearsza 02 S234 
TA O1 Es — rea?) — riere3 — riar34 + rea(riersa + ste) ougs S13.24 
03 1 — re32 — roa? — 342 + 2 resrearsa 03 S234 
pap ee O1 1 [Aa — 7237) — rier24 — 113734 + r23(rier34 + Trafee)) GL $14.23 
O4 1 — re32 — roa? — 7342 + 2 resrear34 04 S234 
b _ G2 [ri2(1 — 134?) — resris — rearia + raa(reari4 + aan" _ 02 812.34 
ag E 1 — 1132 — ri42 — 7342 + 2 risriar34 ~ 61 Sisza 
b _ G2 [7re3(1 — ria?) — rieris — rearas + ria(rigrsa + os _ 02 823.14 
23.14 = [ VT ae Se a a epee ~ 03 S134 
b __ G2 frea(1 — ris?) — rieria — roaraa + 713(riaraa + fats) _ 62 S24.13 
ean TR oe [ 1 — 7132 — r142 — r342 + 2 risriar34 ~ G4 Size 
_ O3[ri3(1 — rea?) — rearig — raaria + rea(rearia + | _ 03 813.24 

bai.24 = O1 [ 1 — rie? — ria? — roa? + 2 rioriaros Gx WSis4 
_ 03 [re3(1 — ria?) — risriz — raares + ri4(risr24 + a) _ 03 823.14 

632.14 = ae al 1 cA) See SO mea ~ 02 Si24 
O3 [rsa(1 — rie?) — risria — ro3r24 + r12(Tisr24 + nue _ 03 834.12 

b34.12 = = al Val =e? Say Sa evo: ~ o4 Siz4 
Oafria(1 — rea?) — reari2 — raaris + rea(rearia + zea) _ 04 814.23 

641.23 = = al il Re | ee Sep CD ma ~ 01 S123 
O4[rea(1 — 1132) — riari2 — r3are3 + 1713(r14723 + ees _ 64 824.13 

b42.13 = = al 1 — rio? — 1132 — res? + 2 riarisres3 ~ 62 S123 
Osfrsa(1 — ri22) — riari3 — reates + ri2(riare3 + ae 04 834.12 

b43.12 = a Al T= Me =e = Teed > Dees o3 Sis 


Another example of four-variable regression is furnished by 


Burt’s data in section 2. The equation for estimating the Binet 


score from the remaining scores is given by Burt* in the form 
Binet = .54 school work -++ .33 intelligence (reasoning) + .11 age, 
the variables being taken from the mean of the whole set and 


* Op. cit. p. 183 
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all expressed in age units. A year’s increase in educational age 
is therefore accompanied on the average by .54 of a year of in- 
crease in mental age, and a year’s increase in intelligence by 
.83 of a year in mental age, etc. Since all the variables are in 
the same unit and the total of the coefficients happens to be 
almost unity, Burt makes the following interpretation: ‘‘Of the 
gross result, then, one ninth is attributable to age, one third to 
intellectual development and over half to school attainment, 

. or in determining the child’s performance on the Binet- 
Simon scale, intelligence can bestow but little more than half 
the share of school, and age but one third the share of intelli- 
gence”’ (op. cit. p. 183). 

These results have been seized upon by the antagonists of 
intelligence tests as showing that the Binet scale measures 
chiefly school work and not intelligence, as already noted in 
section 2; but the difficulties involved in such interpretation 
become apparent when other equations such as that for pre- 
dicting age are given. Thus, age =.15 Binet +.51 school 
work + .03 intelligence.* Are we to conclude from this result 
that over half a child’s age is “attributable” to school work, 
one sixth to Binet, and only a small fraction to intelligence? 
Such a conclusion is absurd, but it is logically as sound as 
Burt’s inference regarding his equation. . 

Regression coefficients of any order merely show the average 
change in the dependent variable for a unit change in the inde- 
pendent variable to which. they are attached, the remaining 
variables being constant. If these coefficients are obtained for 
a set of variables all in the same units the relative value of 
the several predictors may be compared as they affect the esti- 
mate. Thus ‘“‘school work’’ is five thirds as valuable as “‘intelli- 
gence”’ in forecasting mental age, and five times as valuable as 

* For the derivation of this equation and other critical comment see Holzinger 
and Freeman, “The Interpretation of Burt’s Regression Equation,” Journal of Edu- 
cational Psychology, December, 1925. For further discussion see also G. H. Thomson, 


“The Interpretation of Burt’s Regression Equation,’ and Holzinger and Freeman, 
“Rejoinder,”’ Journal of Educational Psychology, May and September, 1926. 
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chronological age used for the same purpose. This is quite 
a different interpretation, however, from Burt’s, in which 
regression coefficients are regarded as representing the parts of 
the independent variables going to make up the total of the 
dependent variable. 

It must be kept in mind that ordinarily only a few of the 
possible predicting variables are used in an estimate; also, that 
the regression coefficients will change whenever a new variable 
is added, provided the partial correlations involved are not 
zero. This may be illustrated by beginning with the equation 
for Binet scores on school work, which is approximately of the 
form: Binet = .9 school work. According to Burt’s reasoning, 
nine tenths of Binet will then be ‘attributed to school work.” 
The addition of new variables, however, will reduce this share 
to .7, then to .54, and much lower if enough predictors are 
taken which have some partial correlation with Binet. 


6. MULTIPLE CORRELATION 


The correlation between a set of observed values such as 
given by Xi and the predicted values from the regression 
equation 


KX. = bie.34---nX2+ b13.04---nX3 +++ >t bin.2g---(n—1) Xn + C1 
is known as multiple correlation and is denoted by Ric234---n). 


It may be shown* that 


Rigs---n) =1x,%, = 


2 
__%1.23+-+n f Multiple-correlation : 
1 a? 4 coefficient (155) 


but a more convenient form for calculation is given by 


1—RYo3...n) = (1 a rio) (1 7 Tis.2) (i— T{4.93) jie (i = Taeses f=b)> (156) / 


{Computation form for RP} 
which follows at once from equation (141). 


* Yule, op. cit. p. 248. 
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One use of multiple correlation is in showing how closely X 1 
can be expressed as a linear function of X2, X3--- Xn. If X1 
coincides with the predicted X, for all the observations, the 
standard error of estimate becomes zero and Ri3---n) by 
formula (155) will equal unity. If, on the other hand, the 
residuals X,; — Xi are so large that their standard deviation 
01.23---n approaches oi, the value of Ri23---n) will approach 
zero. The multiple-correlation coefficient thus gives an alter- 
native method for determining the reliability of an estimate 
from a regression equation. 

In order to illustrate the use of these formulas, we may return 
to the data given in section 5 for predicting success in geom- 
etry. Considering that X; is to be estimated from X2 and Xs, 
we may substitute rig = .54 and ri3.2 = .258 in equation (156), 
giving 

1 — Rfos = [1 — (.54)?] [1 — (.258)?]. 
The calculation is very easily done with the aid of Holzinger’s 
Table VI, thus: 
log [1— (.54)?] = 9.85028 
log [1 — (.258)?] = 9.97008 
log [1 — Rios] = 9.82036 
*. Rice3) = .582. 


It is only necessary to add the logarithms and look up the value 
for R corresponding to their sum ; for example, for the logarithm 
9.82038 in Table VI we obtain R = .582, the answer being correct 
to three places. 

In estimating Xi from Xo, X3, and X,4 the equation will be 
ihe Riess = = (1—riz)(1— rie) oS Tiss) The necessary arith- 
metic is therefore 

log [1— (.54)2] = 9.85028 
log [1 — (.258)?] = 9.97008 
log [1 — (.234)?] = 9.97554 
log [1 — Riess) | = 9.79590 
vires. Ole. 


PARTIAL AND MULTIPLE CORRELATION 309 


When o1.234 has already been computed by formula (154), 
it is of course necessary only to substitute this result in for- 
mula (155). 

The regression coefficients are the best possible weights which 
can be assigned to the variables X2, X3--- Xn in making a 
linear prediction for X;. The multiple-correlation coefficient, 
therefore, gives a useful measure of the correlation which can 
be expected from pooling the predictive tests in the form of 
a regression equation. Thus in the above example the coeffi- 
cients .582 and .612 measure the reliability of estimates from 
pooling two and three predictors in the best linear form. 
The gain in reliability is very slight, however, when a third 
variable is added, a conclusion which was reached also by com- 
paring the probable errors of estimate, 19.5 and 19.0. 

An interesting application of the method of multiple correla- 
tion is given in the volume on Psychological Testing in the 
United States Army,* where the possibility of increasing the 
correlation between the Beta scale and the Stanford-Binet 
test is determined. The necessary zero-order coefficients are 
given in the following table. 


TABLE 82. CORRELATIONS OF BETA TESTS WITH STANFORD-BINET 
MENTAL AGE AND WITH EACH OTHER (653 CASES) 


BETA TESTS 
TEST 
1 2 3 4 5 6 iG 8 

Stanford-Binet. > 55). - 465 | .545 | .614 | .639 | .622 || .586 | .610 | .572 
Beta Tests 

MP VGAZOCr Neate. ciao: ys. says ATT | .522 | .514 | .457 | .490 | .510 | .476 

OmCubeeaem os the: ay dt .682 | .576 | .560 || .556 | .592 | .551 

SMP XK OUSCTICS WR. ya. Us ast fs es .689 | .670 | .584 | .597 | .619 
Pee Digitisympolin a) im. 5 .766 | .654 | .584 | .695 

5. Number check .... . 2619) | b2ie | 03 

GREICture sw meaner cor ees -555 | .569 

7. Geometrical . . = « «=: “ 559 

SeOpotnpatvern eae ieml nen 


* Memoirs of the National Academy of Sciences, Vol. XV (1921), p. 387. 
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Upon applying the multiple-correlation formula (155) it was 
found that Rs:1234567s) = -731, which is the highest corre- 
lation obtainable between Stanford-Binet and the best linear 
weighting of the eight Beta tests. The correlation between the 
unweighted pool of these eight tests and the Binet test was .728, 
showing that very slight improvement is made by weighting 
such components. 

The writers of the above report then decided to eliminate 
certain tests as suggested by the results from the partial cor- 
relations and thus obtain a shorter and possibly as good a test 
with unweighted items as with the whole battery weighted or 
unweighted. By empirical trial they found: (1) elimination of 
test 8, r(Stanford x Beta) = .726; (2) elimination of tests 8 
and 2, r(Stanford x Beta) = .728; (8) elimination of tests 8, 
2, and 1, r(Stanford x Beta) = .723. Thus the simple pool of 
five of the Beta tests gave almost as good results as the best 
weighting of all eight. 

The final form suggested was to use a non-weighted pool of 
six of the Beta tests, dropping test 8 and giving test 1 one half 
the weight of the rest. The correlation for this last result with 
the Binet scale was .727, which is only slightly less than the 
best value, .731. . 

Some important properties of multiple correlation may next 
be shown by returning to equation (156). It is apparent that 
every parenthesis on the right is smaller than unity, provided 
none of the partial correlations be equal to zero. Hence 


1 — Ries... ny <1) ieee 
; 2 ’ 
1— Ries...n) < 1— Tiga, 


and i Se Rives +++n) < 1 ae PENS etc. 
Similarly, 1 — Rivso...n) < 1—rf, 
and. i! ca Rite eee n) << 1 aie nie ete. 


The multiple-correlation coefficient R cannot, therefore, be 
smaller than any partial coefficient of zero or of a higher order, 
and it is usually considerably larger. 
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If the coefficients 712, 713 - - - fin are all equal and are denoted 
by riz, and if the coefficients r23, rea - ++ T~m—1)n are also equal 
and are denoted by 7,2, it also follows from (155) that 


lenniflic Multiple-correlation 

Rig... n) = 11x \{ ————————: coefficient for cal 157 
7 Lee a) tae, coefficients en) 
In case C is to be predicted from n other variables, n —1 may 
be replaced by 7 in formula (157), giving 


n 
R123... n) = Tex \ ie eh pT (158) 


which is the same result as that obtained in equation (51) of 
Chapter IX. 

If the numerator and denominator under the radical of equa- 
tion (158) be divided by n, and then n be allowed to approach 
infinity, we find that 


Ae ies Limiting value for 
R +360) = ——=" g 
Seas: Were (158) when n> 00 (159) 


These last two equations are useful in estimating the limits 
for prediction. Suppose, for example, that there are 50 unre- 
lated environmental conditions, each correlated to the extent of 
.05 with human physical traits (r2,= 0, and rez = .05). Upon 
substituting in (158), we find R = .05V50 =.35. In actual prac- 
tice, however, there is a correlation of about .5 between such 
environmental conditions, so that by using (159) we find R= .07; 
that is, an infinity of such conditions increase the correlation 
from .05 to only .07. 

The best results are of course obtained by seeking predic- 
tors which correlate high with the criterion and low amongst 
themselves. Thus, if rcez=.6, zr = .4, and n =10, we find, from 
equation (158), that R=.88. Arbitrary values may be substi- 
tuted in this formula, giving a result greater than unity; but, 
from the constitution of the whole set of variables, this can- 
not occur in actual practice. - 
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7. SOLUTION BY DETERMINANTS 


While the methods of calculation shown thus far are probably 
as convenient as any up to four variables, another procedure will 
next be given in which determinants are employed. The student 
who is familiar with the theory of determinants and who has the 
use of a calculating machine may find this method fairly rapid. 

The chief function used is the determinant of all the zero- 
order correlations given by 

(pbb Los Beyh OC Tere 
ib} Ubpy  Ueph DOD Line) 


yBGy Ley LES; 952 gag! Determinant 
A=|--- c sexoorer | (160) 


coefficients 


Lae ee UR SS Lap 


A minor such as Ajo is obtained by striking out all the coeffi- 
cients in the row and column common to rie. A cofactor, A;;, is 
equal to the minor A,; with the sign that would be attached in 
expanding the determinant. Thus the three-rowed determinant 


Y11 «=—72Q1 


oe Determinant for 
A=I(ri2 Teo Tse three variables } (161) 


T13 Yes sg 


may be written A = 7r11Ai1 — r12A12 + 1713A43 


or A=7r1Ai1t+1ri2Ai2e + 1713443 
r 
re 227 gale ie 1)|" US ee 21731 
123733 Ta3r eile T22732 


= ri1(Teer33 — 733) + 112(risre3 — risrss) 
+ 113 (Ti2723 — Tisr22). 
Simplifying this last expression, we find that 


A=1- rh — tr. — 135 +2 T1271s%23, oer HA (162) 


which, of course, is the same as S123 of section 5. 
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With a similar notation Professor Pearson* has shown that 


112.34...» = —"=» 163 
: V AiiA22 ee 
—A Ae 
TRS Renn = (164) ~~ 
Are 
nN 
Ri3...n) = ae (165) ~ 
re 
A 
O93 2.6m = Of NAN ACA ree (166) * 
11 
—A 
baw. = aS (167) 
11 
—A 
b1e28-- keen = (168) 
11 


These formulas will next be illustrated by the problem in pre- 
dicting geometrical success. Arranging the zero-order coeffi- 
cients from Table 80 in the form of a determinant, we have 


1 54 49 .41 
_|64 -k— 58-28 
Bo eepse 150i) 

Ate 200050, 


A 


This may be worked out by reducing to a determinant of lower 
order. 


Multiplying each row by the reciprocals of the items in the 


first columns, we have 


RECIPROCAL OF 


CoLGuan COLUMN 1 

1 1 0.54 0.49 0.41 |x 1.000 
1.852 es 1 1.852 1.074 0.587|x .54 
2.041 1 1.184 2.041 1.020|x .49 
2.489 10.707 £15220) 92.439) <4 


(If all the elements of a row (or column) are multiplied by the 
same number n, the determinant is multiplied by 7.) 


* Unpublished lecture notes. 


of. ‘e : Ltn ( A hoe J L (wr wave adchend Ly Fanaa oe thy hb ma wa 
i\ 


A 


4 “net Deri tu, pee f im epee 
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Next, subtract the elements of the first row from those of each 
of the other three rows (this leaves the value of A unchanged). 


10.54 0.49 0.41 
_|0 1.3812 0.584 0.127 
0 0.644, 1,551" 0.610 

0 0.167 0.730 2.029 
==710Sol.o 121 bole 2/029) aio xsl) 

— .644(.584 x 2.029 — .127 x .738) 
+ .167(.584 x .61 — .127 x 1.551) ] = .3112. 


1.312 0.584 0.127 
x .1085 =/0.644 1.551 0.610} x .1085 
0.167 0.730 2.029 


A 


The determinant can of course be reduced to two rows before 
expanding, but the arithmetic from the three-rowed value above 
is very rapid on a machine. 

The other determinants required may be worked out in a 
similar way, that is, 


1. OS eee 4 .49 41 
Aiy=|.58 1. .b0/=--.498, Ai2=|.58 1 .50/=-+.192. 
(20 2. OU me ced SOU) ek 


Also, Age = aL .043, A33 = a. .585, Aas = ad 439, Ais =— OT79, 
and Ayz=-+.1095, so that A1iz2=— .192, Ai3 = — .0779, and 
Az = — .1095. 

Substituting these values in formulas (163) to (168), we find 


Poi Sel Pe 
V 498 x .543 
113.24 = et Oe ee se 
V 498 X .585 ; 
114.23 = ee == Rul 
V.498 x .439 3 


Ri234) =. 1— ae ea aly Ma 


and = 3112 _ 
01.934 = 85.5 “498 28.1. 


aD P.E.1.934 => .6745 01,234 = 19.0. 


== 309, 
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Al ie ee? se 
So, bi2.34 498 6.87 > 1.99, 
Be 019 85.5 4 
bi3.24= "49g 21.98 = “261, 
A Age + .1095 x 35.5 
an bi4.23 VS Se oe 919. 


It should be noted that the above results agree with those 
found in section 5, since Ai2 = Si2.34, Ai3 = — Si13.24, A1a = 814.23, 
Ai1 = S234, Az2 = Siza, Ag3 = Si2ea, and Agg = S23. 

When more than four variables are involved, it is probably 
best to use reduction formulas of the type 


119.34 — 115.34725.34 Partial cor- 
712.345 = St f relation at (169) 
V1 lap. gall — 73534) | third-order 


(r12.34 — 115.34725.34) 01 V1 Vilas Vie V1=riss a 
(1—rg5s4)  % Vie pera 


{Regression coefficient of third-order} 


and b12.345 = r¢l 70) 


and carry out the arithmetic on a calculating machine with the 
aid of Miner’s Tables. The computation is not only easier than 
by determinants, but the checks rie.34= 712.43 etc. already 
noted can be conveniently made. A good example of a corre- 
lation problem in five variables is given in Pearl’s ‘‘ Medical 
Statistics and Biometry,” p. 329, while other methods of 
calculation may be found in Kelley’s ‘Statistical Method,” 
chap. xi. 


EXERCISES 


1. Data: 113 pupils (67 boys and 46 girls). Variables, (1) age, 
(2) weight, (3) standing height, (4) sitting height. Correlations, 
T12 = allay, T13 = .85, T14 = wt; T23 = .89, T24 = .90, 34 = 94. 


Work out the partial correlations of the second-order. 


(ri2.34 = — .007, 113.94 = .00, 114.23 = — .04, 


Yo3.14 = .26, yy a 8 734.12 = .63. 
Ans.) 
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“2. Calculate the first-order and second-order partial-correlation 
coefficients from the following data: 
riz = 78, T13 = iby. T14= 40, 123 = 48, T24 = a) 134 = Sie. 


712.3 = .720 723.1 = pee = 
=O pn Zi) 
es v7 ptt ee 403 123.14 
713.2 = 138 =it 124.1 = —.- ae, 7 AG Ane 
rise = 309 p74 —" Sy 054 124.13 
Thine = Han 56 1345 — bee fee 44 
hie 18 tT . pts AbA 34.12 
3. Given: T= .481, io SEL 4S 494, 
fog = + Olas 24 — Bes) 2 .286, 
— 34.48, 02 = 2.89, 03 = A Sis. 04 = ea (S). 


M, = 99.94, M2 = 73.64, M3 = 78.238, Ma — Tia: 


Verify the following results: 
X, = 1093 — 2.09 X2 — 7.40 X3 — 3.36 Xa, 
01.234 = 21.69, Ri (234) = aise. 


4. The following regression equation was obtained by F. L. 
Whitney (Journal of Educational Research, May, 1923): 
X, = 23.218 + .004 Xe — .088 X3 — .115 X4 + .915 X5 + 1.403 X, 
— .085 X7 + 3.02. 


Predict the teaching success of a student with the following records: 


X, (to be predicted) = score on a rating scale. 

X2 = 80 = intelligence score, . 

X3 = 89.4 = high-school academic record, 

X4 = 8.7 = normal-school academic record, 

Xs5 = 8.5 = normal-school professional record, 

X, = 8.6 = student-teaching record, 

X7 = 9.0 = measure of physique. 

(Xi = 38.24 3.02. Ans.) 
Interpret the regression coefficients. Do good academie work and 
good physique interfere with good teaching? 


coe 
6. Derive formulas (151a), (151b), and (151c¢). 
' 7 Derive formulas (157), (158), and (159). 


5. Derive the formula Ri (23) = “ 


CHAPTER XVI 
THE ELEMENTS OF CURVE-FITTING 
1. INTRODUCTORY 


The investigator in many fields of science is frequently inter- 
ested to determine the mathematical curve underlying his data. 
Such a curve is not only desirable in furnishing the theoretical 
law to which the observations conform, but is also of practical 
value as a basis for estimation. In the fields of education and 
psychology examples are furnished by learning curves, physical 
and mental growth curves, and frequency distributions. It is 
important to know the general laws of mental growth as well 
as to predict the standing of individuals of a given group, and 
for such purposes it is usually necessary to fit the data with a 
curve whose constants depend upon the observations. The 
plot of the experimental data often suggests some mathemati- 
cal function which will be a good approximation to the observed 
material, allowing for the minor fluctuations in sampling. The 
problem is then to select the type of curve which is to be fitted 
to the data and to obtain its equation by appropriate methods. 
The suitability of the curve selected may finally be determined 
by tests for goodness of fit. 

The choice of the proper sort of mathematical function will 
depend a great deal upon the worker’s experience in curve- 
fitting and the accuracy of fit required. It is a well-known 
fact that by putting as many constants into the equation as 
there are observations the resulting curve will pass through all 
the observed points. If this is done, however, an extremely 
complicated function will result and the minor fluctuations, 
which should be smoothed out, will be given undue emphasis. 


It is therefore better to use a simple function involving only a 
317 


ws 
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few constants, securing in this way a smoothing or graduation 
of the data which allows for the small fluctuations of sampling. 

In the present chapter we shall introduce several of these 
simple curves and show how they may be fitted to the observed 
data. The observations to be fitted may consist of a series of 
points resulting from two measured characters such as the 
amount learned in a given time, or they may be given in the 
form of a frequency distribution. Three types of curves will be 
presented for fitting data of the first sort, while the normal 
probability curve will be used to illustrate the method of grad- 
uating frequency distributions. It should be noted that these 
curves have been selected from a very large number available 
because they have been found to give good results with certain 
data. They are presented here chiefly for illustration of the 
methods of fitting. 


2. TYPES OF CURVES 


In dealing with growth and learning data one of the most 
useful functions is the hyperbola, which for the purpose of curve- 
fitting may be most conveniently written in the form 


+c. {Hyperbola} | (171) 


Te e. 
4 Lent ata Cie 7 m 


Y= 
a+ bX 
The constants a, b, and c are to be determined from the obser- 
vations. The use of this curve will be illustrated in applying 
the method of averages in section 5. 
Another curve which has been found to give a good approx- 


imation to growth data is the logarithmic growth function, 
Y=a+ bX+clogX. {Logarithmic growth function} (172) 


This curve is similar in appearance to (171) and will be shown 
to give approximately as good results with certain data. The 
introduction of the terms a+ bX has the effect of raising and 
stretching out horizontally the ordinary logarithmic curve, 
Y =clog Xx. 
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A third and very useful function is the nth-order parabola, 
Y=Co+ CX + CoX? + CgX3 + - -.- + CpX”, {nth-order parabola} (178) 


where the C’s are constants determined by the data. If 


C2=C3=---=Cn=0, this expression reduces to the equation of 
a straight line; if C3 = Cs=---=C,=0, an ordinary parabola 
results; while if Cy = C5 =---=C, =0, a cubic is obtained, ete. 


In the case of regression curves from correlation tables, a very 
good fit is often obtained by the use of the nth-order parabola, 
but the question of how many terms to include must frequently 
be decided by trial and error. 

A full discussion of frequency curves is beyond the scope of 
the present text. We shall therefore confine our illustration 


x 


in the last section to the normal curve y= yoe 2%, which is 
already familiar to the reader. 


3. METHODS OF CURVE-FITTING 


The first step in anticipation of curve-fitting is to plot the 
observed data so as to note the trend of the points and to deter- 
mine, if possible, the appropriate curve to use. Having chosen 
some simple form such as described above, it is next necessary 
to determine the approximate values of the constants appear- 
ing in the equation. The methods used for such determina- 
tion will depend upon the degree of accuracy required in the 
fit. If only a rough idea of the trend is required, a free-hand 
curve drawn through the observed points may be sufficient. For 
more accurate results, however, it will be necessary to apply 
certain mathematical methods known as averages, least squares, 
or moments. The first three of these methods will next be 
described, while the method of moments will be treated in 
sections 8 and 9 in dealing with frequency data. 
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Free-hand Method 


A free-hand curve drawn through the observed points is 
clearly the easiest and simplest method to employ, but as al- 
ready noted it may give results which are quite inaccurate. 
Several workers, moreover, would not agree closely upon the 
same free-hand graduation. 

In drawing a curve through a series of points the fitting is 
often facilitated by the use of curved pieces of celluloid (French 
curves). These may be moved about as the curve is drawn so 
that the largest possible number of observed points lie on the 
curve or deviate equally on either side. 

It sometimes happens that the most elaborate mathematical 
methods fail to give a good fit with certain data over a part of 
the range. In such cases it may be desirable to resort to free- 
hand approximations, possibly in combination with the other 
methods.* 


Method of Averages 


A second and more accurate method of curve-fitting is the 
method of averages. If Y represents an observed ordinate and 
Y denotes an ordinate on the fitted curve, the vertical .devia- 
tions Y — Y are known as residuals (see Chapter IX). It is 
assumed in the method of averages that the ‘“best”’ fit is 
that which makes the algebraic sum of the residuals equal - 
to zero. 

In the case of a straight line Y=a-+ bX the above condition 
requires that 

z=(Y — Y) = Z2(Y —a— bX) =0, 
or LY — Na — 02 X¥=0. (174) 


By dividing data into two parts, two equations of this type may 
be formed and solved for the constants a and b. 


: * For an example of this sort see an article by the author, ‘‘On the Relation of 
Vital Capacity to Certain Psychical Characters,’ Biometrika, Vol. XVI, p. 140.' 
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While the method of averages may be used with functions 
involving several constants, it will be found most convenient 
when applied to the straight line where only two constants are 
required. This method will be illustrated in detail in section 5. 


Method of Least Squares 


The third and one of the best methods of curve-fitting is 
known as the method of least squares. This procedure, it will 
be recalled, has already been used in Chapter IX in obtaining 
the equations of the regression lines. Further illustration will 
now be given for the nth-order parabola. 

If Y represents a value on such a parabola and Y an observa- 
tion, the problem is to find values of the constants Co, Ci, Co: - 
C, such that the sum of the squares of the residuals (Y — Y) 
is as small as possible, that is, to have 


U=2(Y —Co — Ci X — CoX?2—-- - —C,X")* =a minimum. 


This is accomplished by equating to zero the partial deriva- 
tives* of uw with respect to Co, C1, C2 --- Cy, and thereby ob- 
taining n+ 1 equations for the solution of the n+ 1 constants. 

Differentiating in this way and setting the obtained results 
equal to zero, we find 


noe = 22(Y — Co — C1X — CoX?—- --—-C,X")(—-]) = 

8Co 

i DE =O. AE OE =O CED) OM 
1 

ae =22(Y — Cx Obes C2X2—-+--—C,X")(— X?) = 0, 

aC2 

i _oeyeou eK 0 


or, rearranging, 


* See any good calculus, suchas W. A. Granville, Differential and Integral Calculus. 
Ginn and Company. 
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ZY = Cod(1) + Cid(X) + CoB (X?) +--+ + Cr2(X), (175a) 
EZXY = Col(X) + Cyd(X2) + Cod (X82) +--+ + CpB(X"+1), (175b) 
>xX2Y = oe ) + BAS + ie On free t pee SES (175c) 


ary = CoB(X) fe Czceet 1) 4 Co3(X"+2). aap CoE(X? 7) (175d) 


{Normal equations, unweighted dishes 


where 2(1) = the number of ordinates summed. 

These last expressions are known as the normal equations, of 
which there are clearly n + 1in number. The variable Y is given 
by the observed ordinates taken from a convenient origin, while 
X may also be measured as the deviation - - -, —3,—2,—1, 
0, 1, 2, 8, --- from any arbitrary point. The quantities ZY, 
DXY, 2X7Y, etc., are found in the manner illustrated in sec- 
tion 7, and the n + 1 resulting linear equations are then solved 
for Co, Ci, Co--- Cn. 

In the above case it has been assumed that the ordinates Y 
are to be given equal weight. If these are obtained from the 
means of arrays, however, the frequencies f, may need to be 
taken into account. It is then necessary to make the sum 


X(fe¥Y — Cofe— CifeX — CofrX?2—--+-—Cnrfr-X")? = a minimum, 


giving rise to the following normal equations: 


2f.¥ = CoX(fx) + CrB(frX) + Cod(fxX?) 


+ +++ Crd(f:X"), (176 a) 
2f.XY = meted + C1 2(f2X?) + CoX(f.X8) 
feet sowie ), Be?) 


chore co) + OSS 1) 4 + CoB(f.¥"+2) 
treet Crd(f.X2"). (176c) 


ee equations, weighted ordinates} 


To distinguish these two methods the former is said to be 
based on unweighted, and the latter on weighted, ordinates. Both 
methods will be fully illustrated in section 7. 
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In comparing the fit of two or more curves to a given body 
of data, a good test is furnished by finding the squared or mean 
squared sum of the residuals, that is, 

= an Ay 
X(¥—Y¥)? or auar 


A. ILLUSTRATION OF THE FREE-HAND METHOD 


As an example of the free-hand method, we have selected a 
series of seven observations given at the right of Fig. 70. The 
curve was so drawn as to let the points deviate about equally 
on either side. 


Me 
110 
100 
90 

% |e 

80 ed 

511 

70 10 | 2 

22 | 3 

35 | 4 

55 | 5 

6 73.16 

0 101 | 7 
30 
20 
10 

02 ix 


Oe ee ooo ee be 6S ees 


Fic. 70. A free-hand curve drawn through seven observed points 


If an approximation to the equation of such a curve is 
desired, it may often be found by rectification. Thus if the 
desired equation has the form 


f(Y) = a+ bF(X),* (177) 


* The symbols f(Y) and F(X) mean a function of Y and a function of X. See 
Chapter ITI, section 5. 
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we may rectify this equation by substituting Y’ = f(Y) and 
hh Go) bbe result, 

Y'=a+ bX’, (178) 
will then be a straight line by means of which the constants 
a and b can be determined. The form of the original function 


/ 
Sy 
100 - 
90 
Y | x2 
80 | 
710 5 | 1 
10 | 4 
60 22 | 9 
35 | 16 
a 55 | 25 
40 - 73 | 36 
101 | 49 


0 10. 20) 90 Wea0 Serta? 
Fic. 71. Illustrating the method of rectification 


(177) must, of course, be guessed, but if a straight line results 
from (178) the choice is justified. 

In the above problem it looks as if the desired equation might 
be a parabola of the form 


Y=a+ bx, (179) 


Setting Y’= Y and X’= X2, we may then find the plot of 
Y and X? to see if a straight line is obtained. 

The graph in Fig. 71 clearly justifies the choice of the parab- 
ola, so that it only remains to obtain the constants a and b. 
Since the first and last points appear to fall on the line, we 
may obtain approximate values for these quantities by solving 
the resulting equations, 5=a+6 and 101=a+49b, giving 
a=3 and b=2. The equation of the parabola is then 


Y=3+2X4 
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The method of rectification is useful not only in justifying 
the form of the function assumed but in furnishing a simple 
method of obtaining the necessary constants. 


5. FITTING A LEARNING CURVE WITH A HYPERBOLA BY 
THE METHOD OF AVERAGES 


The data used for the present illustration were interpolated 
from a graph* showing the number of words typed in four 
minutes, Y, for various numbers of pages written, X. Inspec- 
tion of Table 83 and Fig. 78, where the data are plotted, sug- 
gests that a hyperbola might be a good fit, and this is the curve 
employed by Thurstone. 


TABLE 83. DATA FROM L. L. THURSTONE’S EXPERIMENT IN TYPEWRITING 


ea 
ToTaL NUMBER OF PAGES WRITTEN ee a wakica he Cee en 

250 148 

230 145 

210 138 

190 133 

170 130 

150 120 

130 1S 

110 110 

90 99 

710 90 

50. is 

30 60 

10 39 


Inasmuch as the curve does not pass through the origin, it 
will be necessary to add a constant term to the equation of the 
hyperbola through (0, 0), with the result that 


4 +e. (171) 


== ae 
eaop x 


*L. L. Thurstone, ‘'The Learning Curve Equation,” Psychological Review 
Monographs, Vol. XXV, No. 8 (1919), p. 45, Fig. 5. (Only the odd ordinates were 
used.) 
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The constant c here represents the skill in typewriting which 
the students have at the beginning of the experiment. 

The above hyperbola may be rectified by selecting a point 
(Xx, Yx) on the curve (or one that looks as if it might fall on 
the curve) and forming the differences 


a(x — Xz) 
YY, eee 
"(Gob KNGOXe) 
The hyperbola passing through the point X;, Y; is therefore all 
KAR b 
Fay = (a+ O%s) + Z(a t+ OX)K. (180) 
xX—X 


Setting ; =Z, a+ bX:=m, and *(a + bX;) =n, we 


Y— Y 


may also write Z=m+4+nX, (181) 


which is linear in X and Z. Thus if the plot of X and Z ap- 
proximates a straight line, the original data will be approxi- 
mated by the hyperbola (180), and the equation of the latter 
may be obtained from (181) by determining m and n. 


TABLE 84. SHOWING THE CALCULATION NECESSARY FOR RECTIFYING 
THE HYPERBOLA FITTED TO THE LEARNING-CURVE DATA 


Y= = 
PAGES x Ned Piha oa x-1 Y — 39, = 35 4 

OD Se, Oe ee oe 18 148 12 109 1101 
250M. 12 145 11 106 1038 
ON Sek ae 11 138 10 99 .1010 
OO mes 10 138 9 94 0957 
MO. 22h 9 130 8 91 .0879 
UGS. ee 8 120 7 81 0864 
TEV: eee 7 118 6 74 0811 
TO meee etka 6 110 5 71 .0704 
OU) QP ae 5 99 4 60 .0667 

ON a eee Ne ac 4 90 R} 51 0588 

Or ated ae 3 13 2 34 .0588 

SO peer lass a2 60 il 21 .0476 

TOS agp ee 1 39 0 0 2; 

Tote) Lethe en eae i: wees 
28 .3834 


= 
“Wee 
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The calculation for such rectification is shown in Table 84. 
The first point in the series with coérdinates X,=1, Y;=39 has 
xX—1 
Y —39 
have been plotted with a resulting trend that appears to be 
fairly linear. It now remains to find the equation of the line 
of best average fit. 

By dividing the data 
into two parts and sum- 
ming over each, as shown 
in Table 84, the two equa- 
tions like (181) necessary 
for the determination of 
m and n may be written 


been selected for the origin. In Fig. 72 the values Z = 


.0849 =6m+ 638n 
and .38884=6m-+ 27n. 


It will be noted that 2X 
is reduced to 27 when 
only six items are used. 
Subtracting the second 
equation from the first 


i OB 3 4 8 Be 8 Gol) iia ie 


Pe find Fic. 72. Illustrating the method of rectify- 
to eliminate m, we fin ing the hyperbola for Thurstone’s data 


m = .00560 and, by sub- 
stitution, m = .0387. The required straight line which is shown 
in Fig. 72 then has the equation 


Z = .0387 + .0056 X. 


The equation for the hyperbola may now be written 


Jo ee 
S39 = 0387 + .0056 X, 


X-1 


0387 7 .0056x t 2% CEE?) 


or = 


A list of values for plotting equation (182) is given in Table 85. 
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TABLE 85. SHOWING THE CALCULATION OF THE ORDINATES FOR THE 
HYPERBOLA AND TEST FOR FIT BY SQUARED DIFFERENCES 


y= 


x oo5e x | -0056X ViNeee PREVIOUS eee (Y — Y)2 
; + 0387 | 40387) | COLUMN | por Y) 
+ 39 

Togs en .0728 1115 108 147 1 1 
128) a 2 aes .0672 .1059 104 143 2 4 
ieee ae .0616 1003 100 139 '§ 1 
5 (CUE gine ee .0560 0947 95 134 = 1 
Nain erie .0504 0891 90 129 1 1 
oh ae ae 0448 .0835 84 123 3 9 
(Pe .0392 .0779 17 116 =e 9 
Ceres .0336 0723 69 108 | 2 4 
SA: nF .0280 .0667 60 99 0 0 
coy Akon eee 0224 0611 49 88 2 4 
PD tae .0168 0555 36 75 = 4 
I a .0112 .0499 20 59 1 1 
fo Sere .0056 0443 0 39 0 0 
D(Y — Y)2= 39 


From Fig. 73, where the hyperbola has been plotted, the fit 
appears to be a very good one. A numerical measure of fit 
is shown in the above 


150 table by the quantity 
140 
£120 S(y = ¥y?= 99, 
2 i eg 
110 or 2(¥ — Y)’ = 
100 13 


Je} 
Co 


This result will be com- 
pared later with that 
obtained in the case of 
a logarithmic growth 
curve. The size of 


Words typed in fou 
Sess 


> 
oS 


wo 
o 


10 30 50 70 90 110130150170 190 210 230 250 ys a5 
a Ex i 2 
4 >» % | STotal pages 2(Y Y) 


Fic. 73. Thurstone’s data fitted with a 
hyperbola by the method of averages 


will determine which 
curve is the better fit 
for the same observations. The x2 test is not feasible here, 
owing to difficulties which are beyond the scope of this text. 
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6. FITTING A LEARNING CURVE WITH THE LOGARITHMIC 
GROWTH FUNCTION BY THE METHOD OF LEAST SQUARES 


The data in the preceding section will next be fitted by the 
logarithmic growth curve 


Y=a+bX+c log X, (172) 


using the method of least squares with unweighted ordinates. 
The use of weighted ordinates is usually not necessary, and in 
the above problem the frequencies are not given. 

It is now necessary to find the values of a, 6, and c which will 
make the quantity 


v=2(Y—a— bX —c log X)2=a minimum. 


The partial derivatives of v with respect to a, b, and ¢ are next 
formed and equated to zero, as on page 321. The desired 
normal equations may then be written in the form 


2X(Y) = aX(1) + b2(X) + cU(log X), (183a) 
X(XY) = aX(X) + U(X?) + cX(X log X), (183 b) 
X(Y log X) = aX(log X) + bUCX log X) + cX(log X)?. (183c) 


{Normal equations for the logarithmic growth curve} 


These may be solved for a, b, and ¢, giving the constants neces- 
sary for the logarithmic function of least-square fit. 

The arithmetic is greatly facilitated by a table for sums such 
as 2 (log X), 2(X log X), and (log X)?, which is given on page 
330. For a more extended table of these values the student 
should consult Pearl’s ‘‘ Medical Statistics,” p. 368. 

Upon examining equations (183) it is apparent that the quan- 
tities D(X), 2(Y), 2(XY), 2(X?), and X(Y log X) need to be 
calculated from the data, the remaining sums being obtained 
from Table 86. The calculation of these required sums is 
shown in full in Table 87, where, it will be noted, a check on 
2 (log X) is obtained. 
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TABLE 86. SuMS oF Loc X, X Loc X, AND (Loc X)? 


x 2 (Loe X) 2(X Loe X) (Loe X)? 
fe Ae Ame Be Nea ae A, Ker cee 0.00000 0.00000 0.00000 
Any Detar ee os ED ob ean 0.30103 0.60206 0.09062 | 
BP ute SOAP Re Fue op 0.77815 2.03342 0.31826 
Cb ee th oa Si acer somes 1.38021 4.44166 0.68074 
te Re toed meetin checker momen acy 2.07918 7.93651 1.16930 
BL Tonics cote. Bos gory Se a Btn eS 2.85733 12.60542 1.77482 
Wet. oO dace PE eC ial aOasteee aee 3.70243 18.52111 2.48901 
eS oe Oe ee ee 4.60552 25.74583 3.30458 
Qe nr etn ee rahmen sais Sy sore 5.55976 34.33401 4.21516 
Woe ee oe Soho; «paacecde-ceer 6.55976 44.33401 5.21516 
Wik Ge. Cuca catia Cy sou sien 7.60116 55.78933 6.29966 
OEMS ts wer hoe eee el Come 8.68034 68.73950 7.46429 
TIS} co ecg uc Rome On otic oMcoe co 9.79428 83.22077 8.70516 
ees 6 Oe Gore Uo oece cee 10.94041 99.26656 10.01877 
ascot th arte See On eG Mion cao 12.11650 116.90793 11.40196 
INS Sl vigy COMM MRA Ay errr Clete ce tat 13.32062 136.17385 12.85187 
LUCA SIS oars HOnOm ea oman beter 14.55107 157.09148 14.36587 
INS 3 ac We 3 A eotae weOe Gar sa. 15.80634 ' 179.68639 15.94158 
We bs We tg > oH ad ho 17.08509 203.98270 17.57679 
PAUP wes Sith tmtenats. eae, cabrio 18.38612 230.00330 19.26947 
ate eters anes te ChectO cen 19.70834 257.76991 21.01773 
LAREN CE ee ET a Some RO. nope, 21.05077 287.30321 22.81983 
rahe Wa Soom ede OOF got ao 22.41249 318.62295 24.67413 
DRA Site ROM Ouart IEEE TY - Dono OA 23.79271 351.74802 26.57912 
2 Oltmets Manors Cap ede hsoe wy ley oes 25.19065 386.69652 28.53335 


The calculation of 2(X) and =(X2) is facilitated by the use 
of Pearson’s Tables XX VII and XXVIII, which give the sums 
and sums of powers of natural numbers. 

The normal equations may now be written 


(a) 1,898 = 138 a + 91 b + 9.7948 c. 
(b) 11,826 = 91 a + 819 b + 83.2208 c. 
(c) 1,187.8 = 9.7943 a + 83.2208 b + 8.7052 ec. 


These may be solved by determinants, but straightforward 
elimination is probably as convenient as any method. The 
complete solution is given below for the benefit of those stu- 
dents who have not worked problems of this sort for some time. 
Multiplying equation (a) by 7 and subtracting from (b) gives 


(d) 1540 = 182 b + 14.6607 c. 
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TABLE 87. SHOWING THE FORMATION OF SUMS NECESSARY FOR FITTING 
A LOGARITHMIC FUNCTION BY UNWEIGHTED ORDINATES 


Y=WoOrRDS 
PAGES x IN 4 XY x2 Loc X* Y Loc X 

MINUTES 
DOUTeM ae ee ns 13 148 1,924 169 1.11394 164.86312 
3 ee. dest irs 12 145 1,740 144 1.07918 156.48110 
PANG) Neem. ee il 138 1,518 121 1.04139 143.71182 
OO Nerves vac 2 as 10 133 1,330 100 1.00000 133.00000 
UAL eae ees §) 130 1,170 81 0.95424 124.05120 
LOO Pemene 5 be 8 120 960 64 0.90309 108.37080 
SO ares of ct he Tf 113 "91 49 0.84510 95.49630 
TN ie eee eee 3 6 110 660 36 0.77815 85.59650 
SOFA Tas net 5 99 495 Z5 0.69897 69.19803 
(RS 4 ee ae 4 90 360 16 0.60206 54.18540 
DO pores Vf 3 13 PALES) 9 0.47712 34.82976 
Oar ae 8 es 2 60 120 4 0.301038 18.06180 
Osean ss 1 39 39 1 0.00000 0.00000 
Total 91 1,398 11,326 819 9.79427 1,187.84583 


Multiplying (a) by .75341 and subtracting from (c), we find 
(e) 134.5 = 14.6605 b + 1.3261 c. 


The terms involving b may next be eliminated by multiplying 
(e) by 12.4143 and combining with (d), with the result 


(f) 129.7 = 1.8019 c, or c = 71.98. 


By substitution and check we also obtain a= 34.67 and 
b = 2.663. 
The required growth curve then has the equation 


Y = 34.67 + 2.663 X + 71.98 log X, (184) 


and is plotted in Fig. 74. 

From Table 88, where values for plotting are computed, it 
will also be noted that the sum of the squared differences, 
=(Y — Y)?, is 48, which is not much larger than that obtained 
for the hyperbola fitted to the same data. In this example, 
then, there is little choice between the two curves. 


* These values were read from an ordinary five-place logarithm table. 
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It should finally be noted that quite different forms of learn- 
ing curves result when the time is recorded instead of the 
amount learned per unit of practice or time.* The data when 


E 110 


Words typed in four 


10 30 50 70 90 110 130 150170 190210230 250 
Total pages 


Fic. 74. Thurstone’s data fitted by a logarithmic growth curve 
using the method of least squares 


TABLE 88. VALUES FOR PLOTTING Y = 34.67 + 2.663 X +71.98 Loa X 


PAGEs Xs 2.663 X | 71.98 Loc X |ORDINATE, Y| (Y— Y)? 

ZOOS eke a ee ae 13 34.619 80.181 149.5 AS 
DOO Te ee ae eee 12 31.956 17.679 144.3 49 
ORE Sere oy eM hee mee ah 11 29.293 74.959 138.9 81 
LOO Ree re en ee 10 26.630 71.980 133.3 .09 
LT Olena eee Soke 9 23.967 68.686 127.3 7.29 
150. 8 21.304 65.004 121.0 1.00 
130 . ff 18.641 60.830 114.1 Pe 
110. 6 15.978 56.011 106.7 10.89 
90. 5 13.315 50.312 98.3 49 
70. 4 10.652 43.336 88.7 1.69 
50. 3 7.989 34.343 77.0 16.00 
30. 2 5.326 21.668 61.7 2.89 
i). af 2.663 00.000 Silos 2.89 
2(Y — Y)2=47.99 


* See Thurstone’s Monograph cited above. 
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recorded in time units may be converted into amount per unit 
of time, as shown in Chapter VI, section 8, and then treated as 
illustrated above. 


7. FITTING A GROWTH CURVE WITH A CUBIC BY THE 
METHOD OF LEAST SQUARES 


The following data were obtained from a correlation table of 
age and ossification ratio, the latter being the quotient of the 
ossified wrist-bone area divided by the area of a quadrilateral 
inclosing the carpal bones. The subjects were 520 boys in the 
Laboratory Schools of The University of Chicago. The meas- 
urements were made within a few days of each birthday. 


TABLE 89. DATA FROM LABORATORY SCHOOLS 


MEAN 
CENTRAL AGE FREQUENCY OSSIFICATION 

RATIO 
OME, kt aniseed ase eee Aa ie akc MAG 3 1.120 
ie). ot A PRN an Ra a Se Re om 13 1.139 
LUE. s . 1b ROG Oe ek On eae aren ee 39 1.091 
GME eee Ne ed teala fe, bt) ae a ks 54 1.055 
MPM a ms .| Gerke use eS fy eee eae ane Lo 84 1.018 
He Aa die Nat ee els NM ee 5 sane Page ia 63 0.971 
SMM PR rg Se oS a oh ah co hoo 48 0.920 
Om ee a Ont Se Se atin te ne 44 0.827 
Tay wo, 8 oho elo, ool een ani Neer aaa oe orion 38 0.757 
Oe en Pow ote ar AS Ne 86 0.674 
i) EMT i es het ed ate SM ELIT Bal wih nah anne iy Ree 30 0.570 
Sy sy Goh. Grae, OR AOek ie Outer hosts garam ree PHN” 0.499 
Pr as, re Mare ts ae, Te yr A a ee 21 0.441 
CM ee ie ae ee OD Etch nba hate 15 0.360 
i & 166 Soo, ote ee a ee ere 8 0.261 
Ota Mere Meee ee MET DE so ls, he Pye a, os 520 


We shall fit a cubic to these data, first by considering the 
ordinates of equal weights, and then by weighted ordinates, 
using the observed frequencies as weights. 

From equations (175), it is apparent that the quantities 
DY) --* D(XeY), and 2(X)- -- 2(X®) will be required. The 
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arithmetic is most easily done on a machine by the continuous 
process, that is, multiplying out the sub-products and adding 
them cumulatively on the calculator without separate listing. 
The complete work is shown, however, in the accompanying 
table. It will be noted that X has been measured from the 
central age, 12, which makes the sums of the odd powers of X 
equal to zero. 


TABLE 90. SHOWING THE FORMATION OF SUMS NECESSARY FOR FITTING 
A CUBIC BY THE METHOD OF UNWEIGHTED ORDINATES 


Y XY X2Y X3Y X2) x8 X¢4 x5 x6 
AGE— 12 
7 1.120 7.840 54.880 384.160 | 49 343 | 2,401 16,807 | 117,649 
6 1.139 6.834 41.004 246.024 | 36 216 | 1,296 7,776 46,656 
5 1.091 5.455 27.275 136.375 | 25 125 625 3,125 15,625 
4 1.055 4.220 16.880 67.520 | 16 64 256 1,024 4,096 
3 1.018 3.054 9.162 27.486 9 27 81 243 729 
cae 0.971 1.942 3.884 7.768 4 8 16 32 64 
4 5 |) HAD -920 -920 -920 1 1 1 1 it 
0 0.827 — — = = = = = —= 
al OST ret Ol -T5T = ahi 1 — if = il iE 
74 0.674 | — 1.348 2.696 — 5.392 4 ats) 16 — 32 64 
ro 0.570 | — 1.710 5.130 | — 15.390 9}| =—27 81 — 243 729 
—4 0.499 | — 1.996 7.984 | — 31.936 | 16| — 64 256 — 1,024 4,096 
= 5) 0.441 | — 2.205 11.025 | — 55.125 | 25 | — 125 625 — 3,125 15,625 
= (3 0.360 | — 2.160 12.960 | — 77.760 | 36 | —216| 1,296 —= {5116 46,656 
=7 0.261 | — 1.827 12.789 | — 89.523 | 49 | — 3438 | 2,401 | —16,807| 117,649 
0 11.703 18.262 | 207.346 594.370 | 280 0 9,352 0 | 369,640 


Equations (175) may now be written 


(a) 
(6) 
(c) 
(d) 


and 


11.703 = 15 Co + 280 Co. 
18.262 = 280 C; + 9352 C3. 


207.346 = 280 Co + 9352 Co. 
594.370 = 9352 C; + 369,640 C3. 


These may be solved by elimination, as illustrated in the pre- 
ceding section, giving 

Co = + .8805, Cy = + .0748, Co = — .002693, 
C3 = — .000272. 
The required cubic is therefore 


Y = .8305 + .0743 X — .002693 x2 — .000272.x3. 


(185) 
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In order to compare results with those obtained by the following 
method, the origin will be shifted to age 13. Taken from this 
point, the equation becomes 


Y =.8305+ .0743(X1+1)— .002693(X1+1)? — .000272(X,+1)3 
or Y = .902 + .0681.X; — .00351.X12 — .000272 X;3. (186) 

Before plotting this result together with the observed points, 
the equation of the cubic by the method of weighted ordinates 
will next be obtained. The arithmetic is much lengthier be- 
cause none of the terms vanish as above. Table 91 shows the 
full calculation for the sums entering into equations (176), each 
of the totals being divided by 520 to give more convenient 
numbers. 


Forming equations (176), we find 


(a) .8584 = Co — .2519 Ci + 10.663 C2 — 25.106 C3. 

(b) .5047 = — .2519 Co + 10.663 Ci — 25.106 C2 + 288.51 C3. 
(c) 7.2352 = 10.663 Co — 25.106 Ci + 288.51 Cz — 1295.2 Cs. 
(d) —1.63874 =— 25.106 Co+ 288.51 C; — 1295.2 C2+ 11,377 C3. 


The elimination is next given in detail for illustration. 
Multiplying (a) by .2519 and adding to (6) gives 


(e) .7197 = 10.600 C, — 22.42 C2 + 282.19 C3. 
Multiplying (a) by 10.663 and subtracting (c), we obtain 
(f) 1.8646 = 22.42 C, — 174.81 C2 + 1027.5 Cs, 
and multiplying (a) by 25.106 and adding to (d), we find 
(g) 19.788 = 282.19 C; — 1027.5 Cz + 10,747 C3. 


This gives three equations in three unknowns. 
Terms in C, are next eliminated as follows: 


(h) .628 = — 480.6 C2 + 8235 C3 [(g) — 26.622 x (e)]. 
(t) .1619 = — 60.23 C2 + 203.61 C3 [.4728 x(f) — (e)]. 
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Multiplying (h) by .18987 and subtracting from (7) gives, 
finally, (7) .0741 = — 248.9 Cs. .°. C3 = — .000298. 
Substituting this value in (h), Cz = — .00870. 

From (g) and (a), we also find C, = .0680 and Co = .9025. 
The required cubic is therefore 


Y = .9025 + .0680 X — .00870 x2 — .000298 x3. (187) 


It will be noted that the coefficients in equations (186) and 
(187) are in close agreement except for the last two, where no 
great effect will be produced except for high values of X¥. Com- 
parison of the two cubics is shown in Table 92, where values of 
Y have been tabulated with X taken from the origin X = 13. 
The plot of these results in Fig. 75 shows that the only notice- 
able difference in fit occurs for the high values of ossification 
ratio, but the number of cases in this range is so small that 
no very accurate smoothing is to be expected. Experience 
generally shows that the method of unweighted ordinates gives 
approximately as good results as the method of weighted ordi- 
nates, except when the weighting is very uneven. 


TABLE 92. VALUES FOR PLOTTING CUBICS (186) AND (187) 


ORDINATES Y 
AGE x 
For (186) For (187) 
PAD ie eo Pan eae Be, eae eC of 1.113 1.095 
ARON oe VAAN Atos ceo. cme 6 1.125 1.113 
Ich re5. < Siaeeeh ietettie. ch MCRAE Tee ac 5 all 1.113 
LW See ce, SO et oe Ce 4 1.101 1.096 
GR tate oer tL Satay ea 3 1.067 1.065 
OL eee Meroe cee orca Pe Proyentee iz 12022 1.021 
1 oes ean ono er eae eee A al 0.966 0.966 
be. @ Ep eee Ge c 0 0.902 0.902 
UPA A cee AES, Oe eee BS -1 0.831 0.831 
A Lets ne sh ostecs Race tae —2 0.754 0.754 
AKO ce eh Sel cet See ine ame ee cee —3 0.673 0.673 
hee Se eae a ee —4 0.591 0.590 
to CRS he ce ree —5 0.508 0.507 
te See et nis. Ba eee —6 0.426 0.426 
(Star ante arbis en Balch oe —7 0.347 0.3847 
De eae Lares hae Vs —8 0.272 0.274 
(NO ee Meat ee Wehbe tt Fie eh HE -9 0.203 0.208 
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Ossification ratio 


zs 45 6 7 8 9 101112 131415161718 19 2021 Age 
Fic. 75. Plot of the cubics (186) and (187) 


8. THE METHOD OF MOMENTS APPLIED TO FREQUENCY DATA 


In fitting frequency distributions with mathematical curves, 
one of the best and most widely used procedures is the method 
of moments, developed for this purpose by Professor Pearson. 
The graduation, or fit, is obtained by equating the moments of 
the data to the moments of the curve to be fitted. 

If a frequency distribution be given with frequencies fi, fo, 
fs, -++fz occurring at class values Xi, X2, X3,--- X;, then the 
sum fiXi+ feXe+ f3X3+---+f:X, is called the first moment 
with reference to the origin from which X is measured. Similarly, 
fi. X12 + foXe? + feX3? +---+f:X.7 is called the second moment, 
and f1X1° + foX2? + f3X33 +---+f,X,3 is known as the third 
moment, etc. These quantities may be more briefly written 
as DfX, ZfX?, TfX3---, so that the 


pth moment about the origin = XfX?. (188) 


When each of the above moments has been divided by N, 


the result, es xfx? : Panes coefficient (189 
2.4 about the origin } ) 
has been termed by Professor Pearson a moment coefficient 


about the origin. 
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The moment coefficients about the mean are given by the 
formula 
p _ My Formula for mo- 
= Zfxt = 2h My {men coefficients ; (190) 
N N about the mean 


The reader will note that the moment coefficient about the 
origin is denoted by ¥,, while the moment coefficient about 
the mean is given by vp. Substituting various values for p in 
(190), and observing that =f = N, we may write 


0 
Vo = ae =]; (191a) 
afx 
V1 = =n =0, (191b) 
2 
Vo = ae = o,, ete. (191c) 


{Moment coefficients about the mean}. 


Certain relationships between the moments about the origin 
and the mean may be obtained by expanding (190). Thus, 


aie 2Ll xe — pxe- im + P@—D “; 1) yp-2y72 
~20=DO=D) xray 4 e | 
or, Vp = Vp — PV 101 + POR) 5, 0:8 
eee he ate — 2) Deets (192) 


Since vp = % =1, we find, upon setting p= 1, 2, 3, and 4 in 
this last equation, that 


Wy4= 0, —v,=0, (193 a) 
V2 = Ve — Vj", (193 b) 
V3 = V3 — 801. + 2 v1, (193 Cc) 
and that V4 = V4 — 4.0103 + 6 V1202 — 3 Vit. (193 d) 


{Moment coefficients about the mean in terms of those about the origin} 
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By taking the moments about the origin, we may also write 
re ea 
de Naa 
POL) 
Lseaeo 


= — —2 
or Vp = Vp t+ PYp—-1V1 Vp—2V1 


poy 5 es 


Transposing we then have 


Vp = Vp — pVp—1V1 — POR) ys 
EPG G= 2) eve See (194) 


6 


Substituting values of p from 0 to 4 gives the following set of 
equations, which may be used as a check on equations (193) : 


pees (195a) 
veo (195b) 
Vg = Vg — v,2, (195 c) 
V3 = V3 — 3 V1V2 — v3, (195d) 
V4 = V4 — 401 v3 — 6 Vy2v9 — Di4. (195 e) 


Moment coefficients about the mean in terms 
of moments about the origin and mean 


The fifth, sixth, and higher moments might be formed in a 
similar way, but Professor Pearson * has shown that, except for 
very large samples, their probable errors are too high for the 
results to be of any value in curve-fitting. 

It should be noted that equations (193) and (195) hold when 
X is measured from any origin, since x = X — M= X’— M’, 
where X’= X —A, A being the arbitrary origin. The moment 
coefficients about the mean may therefore be obtained by choos- 
ing an arbitrary point and making subsequent adjustment as 
in the case of the standard deviation. 


* Karl Pearson, ““Skew Correlation and Non-Linear Regression,” Draper’s Re- 
search Memoirs II. Cambridge University Press, 1905. 
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It is now necessary to distinguish two types of series which 
may arise: 

a. The data may consist of a system of isolated ordinates as 
in the case of the point binomial. This type, however, will 
not be considered in the present treatment. 

b. The data may consist of a system of areas as in the fre- 
quency distribution of a measured variable. Here the moments 
are calculated by assuming that the areas are concentrated at 
the class values and corrections for equations (191) to (195) are 
therefore necessary. These adjustments, which are known as 
Sheppard’s* corrections, will next be given and the complete 
arithmetic shown for a distribution resembling the normal 
curve. 

Denoting the moment coefficients adjusted for grouping by 
M1, M2, M3, and pa, Sheppard’s correction may be written 


Pi = Vi) (196 a) 
He = V2 — zy = Ve — .083333,} (196b) 
bs = Vs, (196c) 


ba = Va — 5 V2 + 555 = Va — .5 ve + 02916667. (196d) 


Moment coefficients about the mean adjusted for grouping 
(Sheppard’s corrections) 


The proof of these equations is based on the assumption that 
the derivatives of the frequency function vanish at the limits of 
the curve. The corrections are to be used therefore when the 
distribution has ‘“‘high contact’’ at the extremes of the scale, 
that is, tapers off gradually at both ends. 

Professor Karl Pearson has developed a number of curves for 
the purpose of describing biometric data. These curves, which 
vary from extremely skewed to symmetrical types, are identified 
by certain criteria worked out from the distributions to which 


* W. F. Sheppard, Proceedings of the London Mathematical Society, Vol. X XIX, 
pp. 353-380. i 
+ Note that o = Vu2h = (Vv2 — y)h. 
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the curves are to be fitted. Some of the constants used by 
Professor Pearson may be set down as follows: 


_ b a Tn ad 
Bi = ki=2B2—3Bi—6, By 
spoil (geen pee ae 
2 pe? > 4(4B2— 3 Bi) (2 Bo — 3 Bi — 6) 


{Pearson’s constants for curve-fitting } 


It will be noted that 8; and Bez are independent of the units of 
measure of the distributed variables. 

The steps in curve-fitting are then briefly as follows: 

1. Work out the first four adjusted moment coefficients, 


M1, M2, M3, and wa. 


2. Form (1, G2, Ki, and ke, in order to determine which type 
of curve to employ. 

3. Find the constants of the curve selected from the mo- 
ments and the (’s (formulas for the maximum ordinate and 
other parameters are given in Elderton* for each type of curve). 

4. Plot the curve with a histogram of the data and note 
the general goodness of fit. 

5. Test the goodness of fit by the x? method, finding the areas 
under the curve by arithmetical or mechanical integration. 

In the following section these steps will be illustrated by the 
normal probability curve. 


9. FITTING A NORMAL CURVE BY THE METHOD OF MOMENTS 


The data selected for graduation consist of the heights of men 
in the British Isles (see Table 41, p. 206). These have been 
chosen because they furnish a fairly good example of normally 
distributed data and illustrate the simplest of Pearson’s types 
of frequency curves. 

* See Elderton’s “‘Frequency Curves and Correlation,” Jones’s ‘First Course in 


Statistics,” and Pearson’s Tables, Introduction, for detailed discussion of these types 
of curves. 
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The criteria for the normal curve y = yoe °, which should 
be satisfied if this curve is appropriate, are 


pi=0 (198) 
and Bo = 3, (199) 
while the constants are determined by 
c=2 po, (200) 
N 
Yo= rae (201) 
and M = 0 = origin. (202) 


{Criteria and constants for a normal curve} 


It is now necessary to work out these values from the data, 
and compare with those given by equations (198) and (199). 
The constants for the curve are furnished by equations (200), 
(201), and (202). 

In calculating the unadjusted moments the arithmetic may 
be conveniently arranged as illustrated by Table 93 on page 
344. Using equation (189), we find from the values at the bot- 
tom of the table that 71 = .020850, 72 = 6.617239, v3 = 0.206057, 
and 74=187.689109. Substituting these values in equations 
(193) or (195), the unadjusted moment coefficients about the 
mean become 


v1 = 0, ve = 6.616804, v3 = — .207833, and v4 = 187.689183. 
The adjusted moment coefficients may now be found from 
equations (196), giving 
pa = 0, we = 6.533471, ws = — .207833, and us = 134.4099. 
By substituting these last values in equations (197), where 
the general expressions for 8; and G2 are given, we find 
| B1 = .000155 
and Be = 3.14879. 
The values for x; and x2 are not required in fitting the normal 
probability curve. 


344 
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TABLE 93. SHOWING CALCULATION OF THE FIRST FOUR UNADJUSTED 
MOMENTS OF A FREQUENCY DISTRIBUTION 


CENTRAL 2 3 4 
HEIGHT f ¢ fd i ia a 
Tzqs 2 10 20 200 2,000 20,000 
1675 5 9 45 405 3,645 32,805 
Thds 16 8 128 1,024 8,192 65,536 
Thy 32 7 224 1,568 10,976 76,832 
345 79 6 474 2,844 17,064 102,384 
7225 202 5 1,010 5,050 , 25,250 126,250 
ys 392 4 1,568 6,272 25,088 100,352 
1076 646 3 1,938 5,814 17,442 52,326 
6975 1,063 2 2,126 4,252 8,504 17,008 
6835 1,230 1 1,230 1,230 1,230 1,230 
67z5 1,329 0 — = = —_ 
665 1,223 -1 — 1,223 1223 — 1,223 1,223 
6575 990 —2 — 1,980 3,960 — 7,920 15,840 
6455 669 —3 — 2,007 6,021 — 18,063 54,189 
6345 394 —4 — 1,576 6,304 — 25,216 100,864 
6245 169 —5 — 845 4,225 — 21,125 105,625 
6145 83 —6 — 498 2,988 — 17,928 107,568 
6075 41 —7 — 287 2,009 — 14,063 98,441 
5955 14 —8 —112 896 — 7,168 57,344 
5845 4 -—9 — 36 324 — 2,916 26,244 
51s 2/-—10 — 20 200 — 2,000 20,000 
Totals 8,585 +179 56,809 + 1,769 1,182,061 
Unadjusted moments = Nv = Nie = Nrs = Ni 


The probable errors of 6; and 82 for samples from a* normal 


population are given approximately by 


PB. of i = SUNS 


P.E. of Bo = ca 


and 


(203) 


(204) 


We may therefore write \/8;=.012-+.018 and B2=8.149+.036, 
and conclude that the normal curve is appropriate even though 


a value of G2 as high as 3.149 is rather improbable. 


When the goodness of fit is tested by x? as in section 7 of 
Chapter XIII, it is found that the fit is satisfactory. This is 


left as an exercise for the student. 


THE ELEMENTS OF CURVE-FITTING 345 


EXERCISES 


1. Fit a hyperbola by the method of averages to the data in the 
accompanying table. Use the scale 1, 2,3 - - - for pages written, and 
select X; = 1, Y; = 30 for rectifying point. 


PAGES WRITTEN Worps TYPED IN Four MINUTES 
370 192 
350 188 
330 184 
310 172 
290 195 
270 178 
250 180 
230 164 
210 161 
190 160 
170 151 
150 142 
130 137 
110 Zs 

90 106 
70 100 
50 81 
30 57 
10 30 
= X-—1 
(? 027 + .0044x 7 20: Ans.) 


2. Fit the data of Exercise 1 with a logarithmic growth curve, 
using the method of least squares. Compare the fit with that ob- 
tained for the hyperbola. 

! (Y = 22.56 + .526X + 127.1 log xX. Ans.) 


3. The data on page 346 are the ossification ratios of 540 girls of 
the Laboratory Schools of The University of Chicago. Fit a cubic 
to the means by the method of least squares. (Use unweighted ordi- 
nates and take the origin at age 12.) 


4. Calculate and plot the means of the columns from the table 
on page 189. Fit a cubic to these points by the method of least 
squares, using unweighted ordinates. Compare the equation with 
the following, based on more data: * 


B = 23.14 + 1.2545 a — .0089 a? + .000025 a3, 


* See Memoirs of the National Academy of Sciences, Vol. XV, p. 576. 
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CENTRAL AGE fi MuAN - ae 
LOT ceykt salon ceri ike et erate Poe 5 1.160 
heer ees ee Sens gee ER RAS. in 14 1.102 
LU ies Ae meme mnrlr a Malan oe aim (5 53 1.098 
LOO vra re iene ci te) Ot eye esi kop io cel me 63 1.108 
LOSI a eee nee ces 69 1.089 
LA aA oe eS so Bus eee 63 1.061 
Lotter Mei aete cs Peta! Rats ease 40 1.033 
OA cate he rie oe toon MC nece: 44 -988 
LAN WM abiat ct aor Wane” seh val a neato oes 38 ‘ .898 
LOM Aiea sae si ic: tuete ks, ree 38 .834 
SSeS Me er rae er 39 -730 
Se Mee tie oa fours Cham eel s,s 26 -662 
eNsee mW iia, Noi So ge Crepe SU Re ne oteh alte LG .523 
OMSL ers on SAN eee CO ee eee 23 442 
DUBE ne. sl a che Wms Late eta aes Ss .358 
540 


(y = .961 + .0576 x — .00475 x? — .0000230 x3. Ans.) 


5. Data: cephalic index of 1982 boys aged 13 (from Professor 


Pearson’s laboratory). 


INDEX ap INDEX f 
1 eee ee OS sre il 18, atten Rae ee 293.5 
D Oreste oral ac tes 1 Clg eee 236.5 
SO era er. Js us Maes 4 (i Ae nes 181.5 
SSH mes, Tenet ca My, ate 4 1D SS ee Rae ee 156.5 
Sle Met ry Pele. yates mt WA ste oe Btcm ce 78 
SOM ges We eel ns eon as 23 18 te 40 Roe, He 49 
8 Oa a Ns, .ayhlid nays 31 TB: 25 ewes 23 . 
SAR ees Goats 3 seks 58 d Liguninge a2) sae Mage 26 
Some ae awl nue 93 TOA 5 aioe cee aires Be 8 
OZ het (ibesy Tatas! ks 130 09 et ewes hee 8 
ohh 7 Bice a iSO 156 68-355. Ea eee 2 
SOM AG io Ake a et le 181.5 GUS) ht sae Cen 3 
(eR eo Seat seme oP 227.5 
Total 85 ere 1982 


Find pe, us, Ma, 61, and G2, and fit with a normal curve. Work out the 
chi-square test for goodness of fit. 

(M2 = 10.980; vs = 2.826; ws= 409.112; Bi =.0041; Bo= 3.398; 
Yo = 238.62 ; P = .0001, throwing together the five highest and also the 
four lowest groups. Ans.) 
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LIST OF IMPORTANT FORMULAS FOR REFERENCE* 


3} 4 Mean for 
M= aN aa meee me (5) 
PPO Zfd Mean for : 
PN (=2) h. Tear (6) 
N 
oo Sup Median for 
Md =1.1. + h, distribution (8a) ~ 
Sma counting up 
N 
q ~ Sao Median for dis- 
Md =u.l. — h. tribution count- (8b) ' 
Ima ing down 
— UEES Gi Core f aaah 
G.M. Xi X2q X3 XN: f ae (9) 
1 Logarithmic 
log (G.M.) = N > log (xX). form of geo- (10) 
metric mean 
z ea x az (3): {Harmonic mean} (11) 
M.D. = ae) .{Mean deviation} (12) 
Mean deviation 
M.D. = (2 | fd |)h + (Am — M)(Na — No) | for frequency | (14) 
N distribution 
Dx? Standard deviation, 
S.D. = aN ae original form \ (15) 
SoM ee Standard devia- 
S.D. = N PaCS iam (M’)?. tion for reduced (16) 
N series 


* For notation see list of important symbols in Appendix B. 
347 
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Sa re Standard de- : 
S.D.= (2 = (24)")n viation for (17) 


N N distribution 
Q= — {Quartile deviation} (19) 
N 
vl = fao 
Q3 = u.l. — me h, (20a) 
{ , Quartiles 
AP ee for distribution 
a=1et+t— xn (20b) 
V= aa {Coefficient of variation} (22) 
re Transmutation formula 
X,=™M, + = (X_ — Mp). for comparable scores, (23) 
02 score form 
Ge (Qs — Md) — (Md — Qi) { Measure afj 
Q skewness 94 
Q1 + Qs—2Md based on (24) 
= ie Can eee quartiles 
MMe Pearson’s measure 
Sr = ome of skewness \ (25) 
_ 3(M — Ma) Approximate meas- ) 
Sk = o ure of skewness } (26) 
DN Fe, 
= 100 Percentiles, 
P,= li. + Sy ae h, counting ce (27a) 
100 — 2) 
a IN do 
= ( 100 ( Counting } 
Pe tat, _|(Ciar")*=*, 1 eae i (27b) 
jee) Percentile 
Ry, = Ri + ee (X—1.1.). ir raianieee (28) 
form 1 J 
_ 100[fx (X —1.1.)+ Cfup)a] Percentile rank 
R= NhA : ieee form 2 (29) 
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. _ 50fx  100(fus) 50 fx Class value 
meter Neca NGA =< TN it S Mae f G0) 
Dxy Product-moment 
ee correlation coefficient, (81) 
Ox0y original form 
: Dxy Correlation evasion} 
a in terms of deviations ¢ (82) 
Vix? Ly? L from means 
2ZXY — NM,M, 


( Correlation aaa (33) 


ion V (2x2 — NM,2)(2Y2 — NM,2) 1 (based on raw scores) 


V (2X2 — T,M,)(2Y? — T,M,) 


Bid lye GENCE 


. { Correlation coefficient 
1 equivalent to (33) } (34) 


{Correlation coefficient for distribution table} 


(35) 


Regression line for 
means of columns, re- 
ferred to mean of table 


| (36) 


a} Standard error 
— 2 
Syeeeo yw hima T of estimate } 37) 
o Regression line for 
i (+ sy means of rows, referred + (38) 
Oy to mean of table 
oO 
Yor!x-r —M,+My, Regression (39) 
cs Ox lines in score 
Forty =r MyM; form (40) 
Oy Oy 
Y= ak Yes ak M,+M Regression lines in} (41) 
bh bh Ad score form and sym- 
bols on correlation 
y=2y_2u, +My sheet (42) 
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o ak 
byx = oa = Bit . (43) 
h {Regression coefficients } 
ea, ene 44 
Oxy = "Sy ck “a 


Probable error of estimate 
PII (eet: 2) = 0140.) VG ee ree 


Probable error of estimate 
= 2 
P.E. (est. X)=.6745 0. VI Pe ee ting X from Y } 46) 


Improvement 
o—oV1—?7?)\ _ =) ae 
[p= 100( o ) =100(1— V1—?°). prediction by (47) 
a single score | 


ae nr _ J Spearman-Brown formula for predicting (48) 
i Let (n — 1)nz reliability of lengthened tests 
es NYcz . { Formula for predicting validity 
Ten = Ve + n(n — Itz if of lengthened tests } (51) 
ot 1 hn Multiple-response 
Sk (n — De = iG scoring formula } (52) 
Correlation 
Rig= pa ee after selec- (53) 
O71 2 2 D1 2 tion 
Dy tie jars | 
O1 
o2 
Nyx =\f/1 — = (55) 
Oy 


Correlation ratios, 


es original form 
Thy = \1 — 75 (56) 


Ox 
on 
AD 
Ts = C Correlation ratios as (60) 
{auotien of iro} 
OXy standard deviations 


Nx = oa (61) 


Zfx(My — Y;)* 
TS ers Correlation ratio 
(62) 


Vx = ————————_-_ for means of col- 
umns 
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Nyx = wi (63) 
c Eerie ratios for | 
ae correlation blank jf 

Ny = wa (64) 


X, =A, + 2 fwd h (65) 
y x re ’ jars of on} 


sf arrays In a cor- 
ie = Ay + ae R. 


relation table 
EG 


ne — rr < 4.047 (n? — r?){(1 — n?)? — (1 — 7)? ++ 13. (67) 
VN 


N {Blakeman’s test for linearity} 


(66) 


VN Vn? — 7? < 4.047. (68) 


{Blakeman’s short test for linearity} 


=F <7 Corrective formula 
Yo=¥e+(¥e— Y1). for eliminating age } (69) 


tS Pes Corrective formula adjusting 
Ys= Ys + (¥— ¥2) oe for age and ea eect (70) 


nP; = n(n—1)(n—2)--+-(n—7+1). (71) 
{Permutation of n things r at a time} 

_n(n—1)(n—2).--- Cpe Tick) ener 

ai 1-2-3---9r ceri 
{Combination of 1 things r at a time} 


(q+p)"=nCog"+nCiq” 1p + nC2q” *p? + nCsq” *pF +: + *+nCnp”. (77) 


{Point binomial} 


nCr (72) 


Mean of the pe (78) 


binomial 


M=np. 


o =WVnpg. {Standard deviation of the point binomial} (79) 


1 - 2 Normal curve 
= 202. 
y = /9 To € ° with area = 1 (80) 
< Normal 
a> oe iege ormal curve 
Ci Tae lence with area=N (82) 
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= 1 =Es Ordinate of the normal curve, ve (83) 
a 27 Cee unit area and standard deviation 
Relation between 
P.E. = 6744898 o. eee, (84) 
a 2235 Mean of a portion of a normal verte (8 5) 
oe No with unit area and standard deviation 
Mean of a portion 
ied hepa. of a normal curve, (86) 
a, fe with area = N 
N 
Ls oxen Probable error 
P.E.m = 6745 \/N of the mean \ (88) 


Probable error of the 
(89) 


P.E.u,—M, — V (P. E.m,)? + (P. Eat). {aierene between two 


uncorrelated means 


P. Bagg = AO = 1.2588 PE {STO (00) 
P.Bag = one = = 1001 PEs {andar deviation } OD 
rave Ss} Fe (SE) 

ee eee 
A I sey remem fee ney NCIS 
P.E.byy = 6745 — Vist (96a) 


NS 
oy Vir 
/N 


P. Eby, = 6745 “4 


of regression co- 


Probable errors 
efficients (96 b) 
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= O1.2k Probable error of higher-order } 
P.E.bi9.4 = -6745 Oo-k \/N regression coefficient sh (97) 
2(.6745)_ ——— 
P. E.3 = ———=— Vm? — 7®){(1 — 92)? (1 — 7)? + 1). 


{Probable error of 72 — r2} 


P.E.a—~p= V(P.E.4)? + (P. E.z)? — 2 Rap(P. E.a)(P.E.z). (99) 


{Probable error of difference with correlated measures} 


P.E.y,-—-M,= V (P. E.u,)? + (P.E.m,)? — 2 112 P. E.m, P.E.m,. (100) 


{Probable error of difference between means where correlated } 


ss f Probable error of an 
P. Ey = 6745 s( SN): observed frequency \ (101) 


fp (100 — fp) Probable error of a per- 
P.E.5, = 6745 ‘aa ama . centage frequency } (102) 


y= Pie el. Sapte: (103) 


fat fi function 


P.E.np = 6745 Vnpgq, 


P.E.y = 6745 22. 


P. E.¢, (of individual X1) = .6745 o,,V 1— 111. (111) 


{Probable error of response for X1} 


(105) 


mean and of the pro- 


Probable errors of the ) 
portion of successes i (106) 


P.E. (of individual 2; — 22) = .6745 V2 —n7— rer. (113) 


T12 Spearman’s eee 
t=: Z (115) 
N/HiPaiT for attenuation 
Cee Ru (16a) 
Se i : Kelley’s formula for 
mar adjusting reliability 
coefficients 


o? — 27(1 — R17) 
a 


ee (116b) 
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eee Dh Ae rE M,)h 
- No. ? (Pearson’s formulas for the) (118a) 
xOy : : 
correlation coefiicient based 
on the means of the arrays 


rape Dfydy(Xy — M;)R (118b) 
Nox0y 
Xy 
Lfydy (2) k Correlation coefficient 
r= —————_- {spied for use with (119) 
Noy data on a normal scale 


Ufey 2 (Zs — Wig ek 1) (2's — 2's = 1) 


clxy SS Se SY 
N N 

E — (Zs — 2s 4 | Be (2's — 2's 4 »?| 
x y 


Pearson’s corrective formula for broad grouping 
assuming normal distributions of the variates 


[= Xye X,\2 
Es J 2f i Correlation ratio adapted 
SN ee 
Ox N 


for use with data on a nor | (121) 


(120) 


yn == 
ce mal scale 
_ Nyx Correlation ratio corrected 
Nyx = Te { for broad categories } (122) 
N Correlation of a variable 
aed Re eckly o8 ae 2 
i= \ ari, (2s — 2s +1)”. with its class value e (123) 
Yo — ¥. ; 
nog eRe (2). {Biserial r} (124) 
Cy z 
GTA5 & fe ) 
P.E.tis. 1) = Probable error 
bias?) /N of biserial r (125) 
alg S, Rar, First computa- 
C=. |(——— = —y) {tn form for+ (128a) 
Wat SN s contingency 
eal Second compu- ; 
C= =F tation form for} (128b) 
contingency 
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Cc Correction to the con- 
CO ° tingency coefficient for (129) 
TxcTyc broad grouping 
ys 2 
6745 $2 Tale? Probable error 
P.E.. = —— | ——"—_ | ° of contingency (130) 
VN (13- *)3 coefficient 
6 X(v, — v,)? Spearman’s formula 
2a See based on rank dif-+ (131) 
N(V2—11) ferences 
7063(1 — p2 Probable error of 
PoE = TAA eas | Seesrman rank (133) 
/N coefficient 
Partial-correlation 
Towa. 
EE { coefficient for three } (134) 
(1 — Tis) (1 — 133) variables 
ee) es 112.34. - -(n—1) — T1n.34.--(n—1)72n.34---(n—1) (135) 


2 2 
[1 WARY Ye sic avi {1 lon S41. ok el 
{Partial-correlation coefficient of the order (n — 2)} 


Xi = bie.sa--- nXo+ b13.04---nX3+-++-+ Bings---(n—1yAnt C. (139) 


{Regression equation for estimating X; from the remaining (n — 1) variables} 


O1.34...n i i 
Dee Tiger: ee eee (140) 


"02.34. of the order (n — 2) 


01.23---n=91 


(1 aa Tia) (1 a Ti3.0) ace (1 = Ti n28 eco Gj »): (141) 
{Standard deviation of the order (n—1)} 


P.E est = 6745 01.93--- 1. (142) 


{Probable error of estimate} 


C=M,— bios -- - nMe—bi13.24--- nM3—--+-—bing3---(n—-1)Mn. (146) 
{Constant term in regression equation} 

= a \ — Tip — Tig — 193 + 2 Mrerisr23 1 V S123 | 

HE SOR) ae ee a 

ih T93 \ / 1 ie Tas 


{Standard error of second-order in terms of zero-order coefficients} 


(147) 


a 
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_ — 134) — 113723 — T1area + 134(r13724 + 114723) 


712,.34= ee 
(1 —1i3—Ti4— 134 +2 7137 14734)(1 aa? 133 — 134 —T34 +2 To31 24734) 
—~ S12.34 . (151 a) 
\/'S134S234 
113(1 = Tia) — 112793 — T1a%3a + Toa(T12734 + 114703) 
713.24= —SS——SSS——SS—08080—090 00 az 
(l—ri.—1ia— 134 +2 1127147 24) (1-123 — 124-134 +2 Tea7 24a) 
ra S13.24 , (151 b) 
\/'S1248234 ’ 
T4(1 ra e) — 1127 e4 — 13734 + %23(T12%34 + 113724) 
714,23= SS 
qa —Yig—Ti3— 133 +2 T19713T23)(1 = 133 = 124 > 134 +2 T 23724734) 
a Si4.23 (151) 
\/S123S234 
{Second-order correlation coefficients in terms of zero-order coefficients} 
b12.34---n 
__ 119.34---(n—1) — T1n.84--- (n—1)T2n.84--- (n—1) 91.34--- (n—1) ’ (152) 


l=77 shel 02.34---(n—1) 
{Reduction formula for regression coefficient} 
01] 712 (1 — r34) — risres — riarea + 134 (Tis¥ea + T1472) 
i234 = ee ee eS 
1 — 133 — 134 — 134+ 2 resroarse 


ot Sia. 34 (158a) 


Og Sosa 


hee E (1 — 34) — rieres — riarsa + Tea (Tersa + ave) 
—_—— EE Lee 


Ss 1 — 135 — 134 — 134+ 2 resroarsa 
_ 91 Si3,24 

Sis.24 (153b) 
~ 03 Sosa 

eee o/s (1 — 13s) — rigroa — risrsa + 193 (riarsa + nee) 
4.23 ee. a 
04 1 — ros —, Ter —— re + 2 Test 24734 
_ 01 $14.23, 
(153c) 

4 Sasa. 


{Second-order regression coefficients in terms of zero-order coefficients} 


Sime Sten s. Standard deviation of 
01.234 = 0} ee third-order in terms of | (154) 
(1 = 793) S234 zero-order coefficients 


R123 soon) = Ty % = 
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2 
1 — 2 128---n 


Multiple-correlation 


a? coefficient 
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} (55) 


I—Rj(93...n) = (1-78) (1-152) (1—rfaes) + —T2nes--(n-)+ (156) 


{Computation form for R} 


aan Multiple-correlation } 
Riz... n) = 11x |<. + coefficient for equal} (157) 
wv aN? + (n — 2)rxx coefficients i 
11 «+Y21 131 Tn1 
12 Tee ‘3g Tn2 
713 123 «133 Tn3 Determinant 
A= ic rors | (160) 
, coefficients 
Tin T2n Y3n Tnn 
Ato Test Re 
= eterminant for 
A=Ir12 Tee Tse three variables \ (161) 
13 T2333 
2 2 2 ( Expanded 
A =1 — rig — 13 — Tog + 2 Ter isres. | value of meee (162) 
— Aig 
1904.1 7 ————— (163) 
V Ai1A22 
— Air 
Tike ken (164) 
V AisArr 
A 
R13 ...n) 1— res (165) 
11 
A 
O193...n=01 V1—R Sie (166) 
11 
— A12 01 
bi2.34...n= 167 
12.34---n Aiaaos (167) 
— Aix 91 
(OB oooftocam = (168) 
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Yo ee seer een 


Y=a+bX+clogX. {Logarithmic growth function} (172) 


Y=Cot CX + CoX2 + CgX3 4+ +--+ C,X". {nth-order parabola} (173) 


TY = CL (1) + C1D(X) + CoE(X?) + +++ + Cn=(X"), (175a) 
DXY = CoD (X) + CD (X2) + CoE (X8) +--+ Crz(X"*1), (175b) 
DX2¥ = CoD (x) ne CBC) v OnE ay) ares p Cn (a9) (175c) 


Sey = CECH) + CICK) 4 CyE(KA+2) $C 3(2”). (175d) 


{Normal equations, unweighted ordinates} 


DfxY = Cor(fx) + Ciz(fxX) ou Coz(fxX) 


+254 Cpd(fxX"), (176 a) 
DfXY = CoE UX) + C12 (fxX?) + CoX(fxX5) 


Aer? OnE fk, (176b) 


Df, X"Y = ay fa") at C3C Serre “f ney Aan 
feet Cad(frX2"). (176c) 


{Normal equations, weighted ordinates} 


=(Y) = aX(1) + b2(X) + cD (log X), (1834) 
2(XY) = aX(X) + b2(X?) + cX(X log X), (183 b) 
2X(Y log X) = aX(log X) + bU(X log X) + cL(log X)?. (183c) 


{Normal equations for the logarithmic growth curve} 


pth moment about the origin = Dfx?. (188) 
ee PRY Moment coefficient 
Sees Nee { about the origin } (189) 
Dfx? rat TO € er M)? Formula for mo- 
Yy == = ment coefficients +} (190) 
N N about the mean 
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Sfx0 
Vo = ae =1, (191a) 
> 
w= =f =0, (191b) 
Sfx2 
V2 = aie z= Onn CLCs (191c) 
{Moment coefficients about the mean} 
vy=v,—v,=0, (193 a) 
Vg = V2.— Vi”, (193b) 
V3 = V3 — 3 V1V2 + 2 043, (193 c) 
V4 = V4 — 40103 + 6 01202 — 3 0i4. (193 d) 
{Moment coefficients about the mean in terms of those about the origin} 
Vo = i (195 a) 
Ve 0; (195b) 
Vo = Vg — Vv}, (195c) 
V3 = Vs — 3 Vive — V33, (195d) 
V4 = V4 — 4013 — 6 012v2 — V14. (195 e) 


Moment coefficients about the mean in terms 
of moments about the origin and mean 


Pi =V1, (196 a) 
He = Ve — zy = Ve — .083333, (196b) 
p-3 = V3, (196c) 


ba = Va — 5 V2 + gg = V4 — Be + 02916667. (196d) 


f Moment coefficients about the mean adjusted for eee 
i (Sheppard’s corrections) 


_ ps” iis plese pas 
Be : (197) 
Bree, ri Bi(B2 + 3) ; 
2 pe? 4 (4 Bo — 3 Bi) (2 Bo— 3 Bi — 6) 


{Pearson’s constants for curve-fitting} 
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LIST OF IMPORTANT SYMBOLS 


In the following list the symbols are given in the order in 
which they first appear in the formulas of Appendix A. 


1. M denotes the arithmetic mean. 
2. denotes the sum of the items of the sort indicated. 
3. X denotes a raw score taken as a deviation from zero. 
.” .4.N denotes the size of the sample or the number of cases used. 
5. A denotes an assumed mean or arbitrary origin. 
6. f denotes the frequency in a class interval. 
7. d denotes a score as a deviation from an assumed mean and 
is expressed in units of class intervals. 
8. h denotes the width of the class interval. 
9. Md denotes the median. 
10. 2.1. denotes the lower limit of the interval containing the 
median in formula (8a). 
11. u.l. denotes the upper limit of the interval containing the 
median in formula (8b). 
12. fup denotes the total frequency up to the interval containing 
the median. . 
13. fao denotes the total frequency down to the interval containing 
the median. 
14. fma denotes the frequency in the interval containing the 
median. 
15. G.M. denotes the geometric mean. 
16. X1X2X3--- Xn denotes the product of the N values of X. 
17. H denotes the harmonic mean. 
18. M.D. denotes the mean deviation of scores from the arithmetic 
mean. 
19. |x] denotes the absolute value of x, where x = X — M. 
20. Am denotes the mid-point of the interval in which M lies. 
21. Na denotes the number of cases above M. 
22. N» denotes the number of cases below M. 


Ve 23. S.D. denotes the standard deviation. 
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24. X’ denotes a deviation of the score from an assumed mean, 
hates, x4i——xe— Al 

25. M’ denotes the arithmetic mean of the X’ scores. 

26. Q denotes the quartile deviation. 

27. Qi denotes the first quartile, which is the value below which 
one quarter of the cases lie. 

28. Q3 denotes the third quartile, which is the value below which 
three quarters of the cases lie. 

29. u.l. denotes the upper limit of the interval containing Q3 in 
‘formula (20a). 

30. 1.1. denotes the lower limit of the interval containing Q, in 
formula (20b). 

31. fao denotes the total frequency down to the interval containing 
Qs in formula (20a). 

32. fupy denotes the total frequency up to the interval containing 
Q; in formula (20b). 

33. fg denotes the frequency in the interval containing Q3. 

34. fi denotes the frequency in the interval containing Q,. 

35. V_ denotes the coefficient of variation. 

36. o@ denotes the standard deviation. 

37. Sk denotes a measure of skewness. 

38. Mo denotes the mode. 

39. Py denotes a percentile value. 

40..p denotes the percentage of the cases smaller than P, in 
formulas (27a) and (27b). 

41. fp denotes the frequency in the interval where P, lies. 

42. Rx denotes the percentile rank of a score X in formes (28) 
and (29). 

43. R, denotes the percentile rank of the lower limit of the interval 
containing X. 

44, Ry denotes the percentile rank of the upper limit of the interval 
containing X. 

45. fx denotes the frequency in the interval containing x in 
formula (29). 

46. -Rx denotes the percentile rank of the middle of the EN 
containing X. 

47. r denotes the product-moment coefficient of correlation. 

48. x and y denote deviations from the respective means for X 
and Y, that is, x= X — M, andy=Y— My. 

49.o¢, and o, denote the standard deviations for X and Y, 
respectively. 
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50. Tx and Ty denote the total of the X scores and the Y scores, 
respectively, that is, T, = 2X and T,= ZY. 

51. fe denotes the frequency of a column of type z in formula (35). 

52. fy denotes the frequency of a row of type y. 

58. fxy denotes the frequency of a cell common to a column and 
a row. 

54. d, and dy denote the deviations in class intervals from the 
assumed means for the two variables. 

55. a, b, and c denote the three parts of formula (35) and are 
defined by that equation. ’ 

56. Sy denotes the standard error of estimate in predicting Y from 
X by a regression equation. 

57. y and x denote points on the regression lines (86) and (88), 
respectively. 

58. Y and X denote points on the regression lines (39) and (40), 
respectively. It should be noted that y= Y— M,and r= X — M,. 

59. h and k denote the widths of the class intervals for X and Y, 
respectively. 

60. byx denotes the regression coefficient for y on x. 

61. bxy denotes the regression coefficient for x on y. 

62. P.E. denotes the probable error. 

63. If denotes the improvement over chance in prediction from a 
regression equation by a single score. 

64. rnn denotes the predicted reliability coefficient of a test n 
times its original length. 

65. 11; denotes the reliability coefficient or correlation between 
two parallel forms of a test. : 

66. ren denotes the correlation between a criterion and a test 
n times its original length. 

67. rez denotes the average correlation between a criterion and 
each of several tests 21, 22, 23 ++ + Z, in formula (51). 

68. rzz denotes the average intercorrelation of the tests 21, 22, 
23 +++ 2, in formula (51). 

69. S denotes the score corrected for guessing by formula (52). 

70. R denotes the number of right responses in formula (52). 

71. W denotes the number of wrong responses in formula (52). 

72. C denotes a constant in formula (52). 

73. n denotes the number of choices in answering a multiple- 
response test in formula (52). 

74. Rig denotes the correlation after selection in formula (53). 

75. 21 denotes thestandard deviation after selection in formula (58). 
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76. myx denotes the correlation ratio based on the means of the 
columns. 

77. yxy denotes the correlation ratio based on the means of the 
rows. 

78. oay denotes the standard deviation of y — ¥,, where 7, denotes 
the mean of a column. 

79. o7, denotes the standard deviation of 7,. 

80. Yx denotes the mean of a column. It should be noted that 
yz = Y— My. 

81. e is defined by Sf.(M, — Yz)?/k?. 

82. d is defined by Sf,(M.z — X,)?2/h?. 

83. &’ denotes summations over an array, for example, over a row 
or column in the correlation table. 

84. Ax and Ay denote the assumed means for X and for Y, respec- 
tively. 

85. Ys denotes a variable corrected for age by formulas (69) and 
(70). 

86. Ys denotes the mean at age A, in formulas (69) and (70). 

87. os denotes the standard deviation of the array at age A, in 
formula (70). 

88. nP, denotes the permutation of things r at a time. 

89. nCr denotes the combination of n things r at a time. 

90. g denotes the probability for the failure of an event. 

91. p denotes the probability for the success of an event. 

92. n denotes the number of independent. events for formula (77). 

93. m denotes the value obtained by dividing the circumference 
of a circle by its radius. 

94. e denotes the base of the Napierian system of logarithms as 
used in formula (80). 

95. yo denotes the ordinate at x = 0 for a normal curve. 

96. z denotes the ordinate of a normal curve with unit area and 
unit standard deviation. 

97. 1x2 denotes the mean of the portion of a normal curve lying 
between the ordinates 2; and 22. 

98. 1ng denotes the fractional part of the area of a normal curve 
lying between the ordinates 2; and 22. 

99. fp denotes a percentage frequency. 

100. X2 denotes the chi-square function given by formula (103). 

101. f’s and f¢ denote observed and theoretical frequencies, re- 
spectively, in formula (103). 

102. e1 denotes response error in formula (111). 
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103. z1 and zg denote standard scores defined by = and = re- 


spectively, in formula (113). 

104. rs: denotes the correlation between ‘“‘true” scores s and ¢ 
which are freed from the influence of response error. 

105. crxy denotes the correlation coefficient corrected for broad 
grouping. 

106. z; and z’s denote ordinates on the two scales of a normal 
correlation surface as used in formula (120). 

107. cnyx denotes the correlation ratio corrected for broad grouping. 

108. rxc denotes the correlation of a variable with its class value. 

109. g and p denote the parts of the unit normal curve to the 
left and right of the ordinate z in formula (124). 

110. rpis. denotes biserial r. 

111. C denotes the contingency coefficient in formulas (128a) and 
(128b). 

112. -C denotes the contingency coefficient corrected for broad 
grouping. fe 
118. S’ is defined by a in formula (128a). 


tly 


N 
2 
114. S is defined by >| } in formula (128b). 


wy 


(of) 
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N 
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@ 
N 

117. p denotes Spearman’s rank difference correlation coefficient. 

118. vx and vy denote the ranks for the X and the Y series, 
respectively, in formula (131). 
hee ..-n Genotes a partial-correlation coefficient of the order 
n — 2). 

120. ares een denotes the square of the designated partial- 
correlation coefficient. 

121. bi2,34---n denotes a regression coefficient of the order (n — 2). 


12256, denotes the constant term in a regression equation and is 
defined by formula (146). 


115. $2 is defined by = > = §—1in formula (130). 


1 


116. 8 is defined by N D> in formula (130). 
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123. 61.93...» denotes the standard deviation of the order (n — 1). 

124. Sigs, Sig.s4, ete. denote sums involving zero-order correlation 
coefficients and are defined by formulas (147) and (151). 

125. Ri3---n) denotes the multiple-correlation coefficient. 

126. A denotes a determinant. 

127. Aj denotes a minor of a determinant obtained by deleting 
the coefficients in a row and column common to r;;. 

128. Ai denotes a cofactor and is equal to Aj; with the sign that 
would be attached in expanding the determinant. 

129. ¥» denotes a moment coefficient about the origin. 

130. vp denotes a moment coefficient about the mean. 

131. j1, pe, ps, and p4 denote adjusted moment coefficients about 
the mean. 

132. B, and B, denote functions of the adjusted moment coeffi- 
cients and are used in curve-fitting. 

133. kK, and ky denote functions of 6; and $2 and are used in 
curve-fitting. 
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SELECTED BOOKS FOR SUPPLEMENTARY READING 


A. TEXTS ON EDUCATIONAL STATISTICS : 
. Statistical Methods Applied to Educational Problems, by Harold 
O. Rugg. Houghton Mifflin Company, 1917. 
A very readable book on elementary methods. 


2. Statistics in Education and Psychology, by Henry E. Garrett. 
Longmans, Green & Co., 1926. 
A good discussion of reliability and partial correlation. 
-3¢ Fundamentals of Statistics, by L. L. Thurstone. The Macmillan 
Company, 1925. 
A clear presentation of elementary methods. 
4. Statistical Method in Educational Measurement, by Arthur S. 
Otis. World Book Company, 1925. 
Contains a full treatment of percentile curves. 
5. Statistical Method, by T. L. Kelley. The Macmillan Company, 
1923. 
An advanced book ineluding many important formulas. 
6. Essentials of Mental Measurement, by W. Brown and G. Thom- 
son. Cambridge University Press, London, 1921. 
Discusses psychophysical methods and the Spearman two-factor — 
theory. 
7. Graphic Methods in Education, by J. H. Williams. Houghton 
Mifflin Company, 1924. 
Shows how to prepare charts and diagrams. 


B. GENERAL TEXTS: 
L- 1. Introduction to the Theory of Statistics, by G. Yule. Charles 
Griffin, London, 1926. 
The best general text, but somewhat difficult for beginners. 
2. First Course in Statistics, by D. C. Jones. G. Bell, London. 


A clearly written text. Contains a good discussion of frequency curve- 
fitting. 
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3. Handbook of Mathematical Statistics, by H. L. Rietz and others. 
Houghton Mifflin Company, 1924. 
A useful reference book. 


4. Mathematical Analysis of Statistics, by C. H. Forsyth. John ° 
Wiley & Sons, 1924. 
Clear treatment of interpolation. Suitable for students with mathe- 
matical training. 
5. Mathematical Theory of Probabilities, by Arne Fisher. The Mac- 
millan Company, 1922. 
A careful development of the theory of probability and applications 
to statistical problems. For advanced students. 
6. Frequency Curves and Correlation, by W. P. Elderton. C. and 
E. Layton, London, 1927. 
A good exposition of Pearson’s System of frequency curve-fitting. 


7. Calculus of Observations, by E. T. Whittaker and G. Robinson. 
D. Van Nostrand Company, 1924. 
An excellent text for the advanced mathematical student. 


8. Mathematical Statistics, by Henry Lewis Rietz. The Open 
Court Publishing Company, Chicago, 1927. 
A concise, clear, and excellent monograph. Especially recommended 
for students who have had calculus. 


. TEXTS IN'OTHER FIELDS: 
1. Medical Biometry and Statistics, by Raymond Pearl. W. B. 
Saunders Company, 1923. 
A clearly written text for students of medicine and public health. 


2. Statistical Methods, by Frederick C. Mills. Henry Holt and 
Company, 1924. 
One of the best books in the field of economics. 


3. Elements of Statistics, by A. L. Bowley. P. 8. King, London, 
1920. 

An advanced book on economic statistics, by the most authoritative 
writer. 


. AIDS IN CALCULATION: 
1. Tables for Statisticians and Biometricians, edited by Karl 
Pearson. Cambridge University Press, London, 1924. 


New edition forthcoming. 
The best tables for advanced work. 
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. Tables of V1 — r? and 1—r?, by J. R. Miner. The Johns Hop- 


kins Press, 1922. 
Every student with access to a calculation machine should have these 
tables. 


. Barlow’s Tables of Squares, ete. (1-10,000). E. and F. Spar, 


London. (May be obtained at The University of Chicago 
Bookstore. ) 
The classical handbook. 


. Tables of Applied Mathematics in Statistics, by J. W. Glover. 


George Wahr, Ann Arbor, Michigan, 1924. 
A valuable aid for the actuary and advanced student. 


. Statistical Tables for Students in Education and Psychology, 


by Karl J. Holzinger. The University of Chicago Press, 1925. 
Adapted for classroom use. 


. Probable Errors of the Correlation Coefficient, by Karl J. Hol- 


zinger. Cambridge University Press, London, 1925. 
Four-place values with proportional parts. 


. Chambers’s Mathematical Tables. W. R. Chambers, London, 


1921. 


Contains seven-place logarithm tables. 


. Five-Place Logarithmic and Trigonometric Tables, by James M. 


Taylor. Ginn and Company, 1905. 
A clearly printed and convenient set of tables. 


INDEX 


Age, corrective formula for eliminating, 
185f. 

Analysis of classified data, 7 

Area of normal curve, 209 f. 

Arithmetical mean, 48; calculation of, 
79 ff.; properties of, 83 f.; reliability 
of, 85 

Arithmetical progression, 47 

Attenuation, Spearman’s correction for, 
2b3 

Averages, 78 ff.; method of, in curve- 
fitting, 320 f., 325 ff. See also Arith- 
metical mean, Geometrical mean, 
Harmonic mean, Median, Mode 

Ayres, Leonard P., 10, 26 


Bar diagrams, 38 f. 

Binomial distribution, 190 ff. 

Binomial law, experimental verification 
of, 199 f. 

Biserial r, 271 ff. 

Blakeman’s test for linearity, 183, 267 

Burgess, William R., 11 f., 299 

Burt, Cyril, 305 ff. 


Calculation, of statistical constants, 7; 
errors in, 65 ff. 

Card, data, 20 

Central tendency, variations in, 79 

Characters, in statistical series, 12 f.; 
ordered and unordered, 13; continu- 
ous and discontinuous, 14; classes of, 
22; static and dynamic, 75; methods 
of correlation for two, 256 ff. 

Chi-Square Test, Pearson’s, 245 ff. 

Class limits, 28 f. 

Class values, 24, 80; percentile rank of, 
138 f. 

Classification of data, 9 ff. 

Classifier, 25 ff. 

Coefficient, of variation, 116 ff. ; product- 
moment correlation, 143 ff.; regres- 

_ sion, 159, 161; reliability, 168 ff.; 
validity, 168; probable error of, of 


variation, 238; probable error of cor- 
relation, 238; of contingency, 278 ff. 

Cofactor in a determinant, 312 

Collection, units of, 11 f. 

Column diagram, 36 f. 

Combinations, 191 

Comparable measurements, 118 ff. 

Compensating errors, 66 

Constants, statistical, 7 

Contingency, coefficient of, 273 ff. 

Coordinates, 40 ff. 

Correlation, linear, 141 ff.; Spearman’s 
theorem on, 168; Spearman-Brown 
prophecy formula, 169 f.; effect of se- 
lection upon, 172; non-linear, 177 ff. ; 
methods of, for two characters, 256 ff. ; 
partial, 283 ff.; multiple, 307 ff. 

Correlation coefficient, product-moment, 
148 ff.; computation of, 146 ff.; in- 
terpretation of, 163 ff.; probable er- 
ror of, 238 

Correlation ratio, 177 f.: probable er- 
ror of, 239; for qualitative and un- 
ordered Benes 266f. 

Courses in experimental and statistical 
method, 2 


“Crude mode, 90 f. 


Cumulative frequency curve, 129 ff. 

Cumulative frequency distribution, for 
Otis Test, 28 

Curve-fitting, elements of, 317 ff. 

Curves, 44; normal probability, 44 f., 
204 ff.; types of, 318f.; fitting nor- 
mal, by method of moments, 214 ff., 
842 ff.; criteria and constants for nor- 
mal, 343 


Data, in statistical investigation, 3 f.; 
collection and analysis of, 6f.; col- 
lection and classification of, 9 ff.; 
primary and secondary, 9 f.; methods 
of collecting, 14 ff.; arrangement of, 
19 ff.; range of, 22; tabular and 
graphical presentation of, 31 ff.; cal- 
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culation of mean for ungrouped, 79; 
fitting normal curve to frequency dis- 
tribution of, 214 ff.; representing, on 
a normal scale, 221 ff. 

Data card, 20 

Determinants, solution by, 312 ff. 

Deviates of normal curve, 209 f. 

Deviation, mean, 102 ff.; standard, 
108 ff.; quartile, 110 ff. 

Diagrams, presentation of results in, 
7.; purpose of, 31f.; column and 
bar, 36 ff.; scatter, 141 f. 

Discontinuous series, 14 

Dispersion, variations in, 79; measures 
of, 101 ff.; comparison of measures 
of, 118 ff. 

Distribution, probable errors of certain 
constants for a normal, 237 ff. See 
also Binomial distribution, Cumula- 
tive frequency distribution, Simple 
frequency distribution 


Efficiency, teaching, 15 

Enumeration in problems, 14 

Errors, absolute and relative, 65f.; 
biased and unbiased, 66 ff.; in edu- 
cational measurement, 74 ff.;  re- 
sponse, 75; of estimate, 161 f.; 
sampling, in the mean, 232 ff.; prob- 
able, of the difference between two 
means, 235 ff. 

Estimate, standard error of, 159, 161 f. 

Estimation, in experimental work, 15; 
of teaching efficiency, 15 

Exponents, laws of, 51 f. 


Free-hand method in curve-fitting, 320, 
323 ff. 

Frequencies, probable errors of observed 
and percentage, 248 ff. 

Frequency distribution. See Cumula- 
tive frequency distribution, Simple 
frequency distribution 

Frequency polygon, 37 f. 

Frequency table, computation of cor- 
relation coefficient for, 149 ff. 

Function, hyperbola, 318; logarithmic 
growth, 318; nth-order parabola, 319 

Functional relationships, 42 f. 


Geometrical mean, 49; and geometrical 
series, 91 ff. 
Geometrical progression, 48, 91 ff. 
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Grouping, correction for broad, 263 ff. ; 
correction for fineness of, 269 f. 


Harmonic mean, 95 f. 
Heteroscedasticity, 186 
Histogram, 37 
Hollerith Machine, 20 f. 


Imagination, constructive, as requisite, 3 

Indexing, numerical or verbal mode of, 13 

Intelligence and attitude, Tulchin’s data 
on, 268 f. 

Intercorrelations, 288 

Interpolation, 56 ff. 

Interpretation of results, 7 


Kelley’s formula for adjusting reliability 
coefficients, 254 
Kurtosis, variations in, 79 


Law of Statistical Regularity for Large 
Numbers, 16 ff. 

Least squares, method of, 321 ff., 329 ff. 

Line, graph of straight, 43 f. 

Linearity, tests for, 183 f., 267 

Logarithms, 47 ff.; invention of, 49 f.; 
laws of, 52; Briggs system of, 53; 
four-place table of, 60f.; use with 
rounded numbers, 71 ff. 


McCall’s method of sealing, 226 f. 
Median, 27, 85 ff. ; probable error of, 238 
Mode, 90 f. 

Moments, 338 ff. . 
Multiple-response scoring formula, 171 


Nomography, 31 
Normal probability curve. 
ability curve 


See Prob- 


Ogive curve, 132 

Ordinates of normal curve, 209 f. 

Otis Test Scores, frequency distribution 
of, 25; classifier for, 26 f.; cumulative 
frequency distribution for, 28; histo- 
gram of, 38; illustrating the median 
for, 86 


Partial correlations, 283 ff.; of first- 
order, 287; of second-order, 287 

Pearson’s correction formula, for broad 
grouping, 2683 ff. ; for Spearman’s rank 
coefficient, 279 f. 


INDEX 


Pearson’s formula for product-moment 
correlation, 259 f. 

Pearson’s Tables, probable error of V 
with, 238 

Percentage frequency, probable error 
of, 243 

Percentile curves, 131 ff. 

Percentile method, 127 ff. 

Percentile ranks, 136 ff. 

Percentiles, definition of, 127; 
putation of, 128 ff. 

Permutations, 191 

Planning of calculations, 7 

Point binomial, 196; mean and stand- 
ard deviation of, 197f.; comparison 
of, and normal curve, 212 ff. 

Predictive value of a test, 168 

Primary records, tabulation for, 6 

Probability, elementary, 192 ff. 

Probability curve, normal, 44f., 204 ff.; 
equation of, 207f.; area, ordinates, 
and deviates of, 209 ff. 

Probable error, 211; of the difference 
between two means, 235 ff.; of cer- 
tain constants for normal distribution, 
237 ff.; applications of formulas of, 
240 ff.; of observed and percentage 
frequencies, 243 ff.; of an observed 
proportion, 248 ff.; of biserial r, 273; 
of contingency coefficient, 278; of 
correlation from ranks, 280; of 61 and 
Be, 344 

Problem, planning study of, 5 

Product-moment correlation coefficient, 
148 ff., 258 ff. ° 

Professional schools 
method, 2 

Progressions, arithmetical and geomet- 
rical, 47 f. 

Proportion, probable error of, 248 ff.; 
standard error of, 248, 


com- 


and statistical 


Quadrants, 40 f. 

Qualitative series, 14, 256 ff., 266 ff. 

Quantitative series, 14, 256 ff. 

Quartile deviation, 110 ff. 

Quartiles, measure of skewness based 
on, 122 

Questionnaires, 15 


Ranks, correlation from, 278 ff. 
Records, tabulation for primary, 6 
Rectification, 323 ff. 
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Regression, lines of, 154 ff.; meaning: 
of, 163; probable errors of coefficients 
of, 239; probable error of higher- 
order coefficient of, 289; equation 
for, 292 ff.; coefficient of, of third- 
order, 315 

Reliability, coefficient of, 168 ff.; 
Spearman-Brown formula for pre- 
dicting, 169f.; Kelley’s formula for 
adjusting coefficient of, 254 

Report, writing of, 8 

Residuals, 158, 320 

Response error, formulas for, 250 ff. 

Results, interpretation of, 7; presenta- 
tion of, 7 f. 

Rounded numbers, arithmetical com- 
putation with, 69 ff. 


Sample, random, 18 f. 

Sampling, 16 ff. 

Sealing of test questions, 224 ff. 

Scores, standard, 168 f. 

Selection, effect of, upon correlation, 172 

Series, types of, 12 ff.; quantitative and 
qualitative, 14, 256 ff.; classification 
of, 15; correlation ratio for qualita- 
tive and unordered, 266 ff. 

Sheppard’s corrections, 341 

Significant figures, 68 

Simple frequency distribution, 22 ff.; 
for Otis Test Scores, 25; calculation 
of mean from, 79 ff. 

Skewness, variations in, 79; measure- 
ment of, 122 f. 

Sorting by mechanical devices, 20 f. 

Source material, secondary, 10 f. 

Spearman’s correction for attenuation, 
2538 

Spearman’s formula based on rank dif- 
ferences, 278 

Spearman’s theorem on correlation, 168 

Standard deviation, 108 ff.; of point 
binomial, 198; probable error of, 238 

Standard error of proportion, 248 

Standard scores, 168; in terms of 
‘true’ scores and response error, 251 f. 

Standardized tests, use of, 2 

Stanford-Binet Tests, 168 

Statistical method, need for, 1f.; gen- 
eral requirements for, 3 ff.; procedure 
in dealing with problem, 5 ff.; ac- 
curacy in, 65 

Statistician, capacity required for, 4 f. 
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Tables, presentation of results in, 7 f.; 
purpose of, 31 f.; construction of, 32 ff. 

Tabulation, for primary records, 6; of 
records, 14; by mechanical devices, 
20 f. 

Tallying, 22 

Terman Group 
120 

Test units, lack of equivalence of, 74 

Tests, uses of correlation in evaluating, 
167 ff.; validity of, 168; scaling of 
questions in, 224 ff. 

Transmutation formula for comparable 
scores, 121 


Intelligence Tests, 


STATISTICAL METHODS IN EDUCATION 


“True” scores, standard scores-in terms 
of, 251 f. 


Validity, 168 

Variability, measures of, 101; absolute 
and relative, 117 

Variables, independent and dependent, 
42; method of eliminating effect of, 
184 ff.; partial correlation for three, 
284; partial regression equations for 
four, 300 ff. 

Variation, coefficient of, 116 ff. 


Yule, G. U., 15, 275 f. 
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